Week 09: R for Data Analytics - Regression Analysis

Machine: MEL_DEL - Windows 10 x64

Random Seed Set To: 3284

Introduction

Regression analysis is one of the most powerful and widely-used statistical techniques in data science. While correlation analysis (covered in Week 08) tells us whether two variables are related, regression analysis goes further: it allows us to predict one variable from one or more other variables, and to understand the strength of those relationships.

In this practical tutorial, we’ll learn how to conduct regression analysis using R, working with real data about mental health outcomes. By the end of this session, you should be able to:

Learning Objectives

Understand the conceptual foundations of simple linear regression
Fit a simple linear regression model and interpret model output
Extend to multiple regression with several predictors
Test regression assumptions (linearity, normality, homoscedasticity, independence)
Identify outliers and influential cases using diagnostic statistics
Compare regression models using hierarchical regression
Work with categorical predictors using dummy coding
Calculate and interpret effect sizes in regression (R², semi-partial correlations)
Produce APA-formatted regression tables
Understand when regression assumptions are violated and potential solutions

Required Packages

We’ll use several R packages for this analysis. Install them if needed:

install.packages("tidyverse")      # Data manipulation and ggplot2
install.packages("rio")            # Easy data import (CSV, SPSS, Excel)
install.packages("car")            # Diagnostics (VIF, leverage, Cook's D)
install.packages("corrplot")       # Correlation visualization
install.packages("psych")          # Descriptive statistics
install.packages("ppcor")          # Partial & semi-partial correlations

Load libraries:

library(tidyverse)      # ggplot2, dplyr
library(rio)            # import() function
library(car)            # vif(), influencePlot()
library(corrplot)       # Visual correlation matrices
library(psych)          # Descriptive statistics
library(ppcor)          # spcor(), ppcor() for partial correlations
library(knitr)          # Markdown tables
library(kableExtra)     # Table formatting

Part 1: Understanding Regression (Conceptual Foundations)

What is Regression?

Regression is a statistical technique for modeling the relationship between variables. In simple linear regression, we have: - One independent variable (predictor, X) - One dependent variable (outcome, Y)

We fit a straight line through the data to predict Y from X. The line is described by the equation:

\[\hat{Y}_i = b_0 + b_1X_i + \varepsilon_i\]

Where: - \(\hat{Y}_i\) = predicted value for person i - \(b_0\) = intercept (where the line crosses the y-axis when X=0) - \(b_1\) = slope (how much Y changes for each unit increase in X) - \(\varepsilon_i\) = error (residual) for person i

Why Do We Use Regression?

Prediction: Estimate outcomes for new individuals
Understanding relationships: Quantify how X affects Y
Hypothesis testing: Test whether relationships are statistically significant
Controlling for confounds: In multiple regression, examine one predictor while holding others constant

The Logic of Least Squares

Regression finds the “best fit” line by minimizing the sum of squared residuals (SSR):

\[SSR = \sum_{i=1}^{n} (Y_i - \hat{Y}_i)^2\]

This means we want predictions that are as close as possible to the actual data.

Plain English: Think of a residual as the “miss” or “error” of our prediction. If we predict a depression score of 20 but the person actually scored 25, the residual is 5. These residuals represent the error present in the model. If a model fits the data well, all residuals will be small. Regression tries to find the line that makes these “misses” as small as possible for everyone.

Part 2: Data Import & Exploration

Loading Our Regression Data

We’ll work with mental health data examining predictors of depression scores. The dataset contains: - CESD_total: Center for Epidemiologic Studies Depression Scale (our outcome) - PIL_total: Purpose in Life subscale (meaning) - AUDIT_TOTAL_NEW: Alcohol Use Disorder Identification Test - DAST_TOTAL_NEW: Drug Abuse Screening Test

# Import the regression data
regression_data <- import("data/regression_data.sav")

# Quick look at the data
head(regression_data)

dim(regression_data)

## [1] 267  11

str(regression_data)

## 'data.frame':    267 obs. of  11 variables:
##  $ age            : num  18 18 18 20 32 19 19 18 19 19 ...
##   ..- attr(*, "format.spss")= chr "F8.2"
##  $ gender         : num  -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
##   ..- attr(*, "format.spss")= chr "F8.0"
##   ..- attr(*, "labels")= Named num [1:2] -1 1
##   .. ..- attr(*, "names")= chr [1:2] "male" "female"
##  $ ethnicit       : num  2 1 1 3 1 1 2 2 1 3 ...
##   ..- attr(*, "format.spss")= chr "F8.0"
##   ..- attr(*, "labels")= Named num [1:6] 1 2 3 4 5 6
##   .. ..- attr(*, "names")= chr [1:6] "caucasian" "african amer" "hispanic" "asian/pacific islander" ...
##  $ major          : chr  "political sci" "pre-pharm" "pre-med" "psychology" ...
##   ..- attr(*, "format.spss")= chr "A39"
##  $ minor          : chr  "" "" "undecided" "Spanish" ...
##   ..- attr(*, "format.spss")= chr "A39"
##  $ gpa            : num  3 NA 3.5 3 3.5 3.8 3.2 4 NA 3.3 ...
##   ..- attr(*, "format.spss")= chr "F8.2"
##  $ class          : num  1 1 1 3 4 2 2 1 2 1 ...
##   ..- attr(*, "format.spss")= chr "F8.0"
##   ..- attr(*, "labels")= Named num [1:4] 1 2 3 4
##   .. ..- attr(*, "names")= chr [1:4] "freshman" "sophomore" "junior" "senior"
##  $ PIL_total      : num  121 76 98 122 99 134 102 124 126 112 ...
##   ..- attr(*, "format.spss")= chr "F8.2"
##   ..- attr(*, "display_width")= int 11
##  $ CESD_total     : num  28 37 20 15 7 7 27 10 9 8 ...
##   ..- attr(*, "format.spss")= chr "F8.2"
##   ..- attr(*, "display_width")= int 12
##  $ AUDIT_TOTAL_NEW: num  1 5 3 3 2 3 2 1 1 7 ...
##   ..- attr(*, "format.spss")= chr "F8.2"
##   ..- attr(*, "display_width")= int 17
##  $ DAST_TOTAL_NEW : num  0 0 0 1 0 0 1 0 0 1 ...
##   ..- attr(*, "format.spss")= chr "F8.2"
##   ..- attr(*, "display_width")= int 16

Descriptive Statistics

# Summary statistics
summary(regression_data)

##       age            gender           ethnicit        major          
##  Min.   :18.00   Min.   :-1.0000   Min.   :1.000   Length:267        
##  1st Qu.:18.00   1st Qu.: 1.0000   1st Qu.:1.000   Class :character  
##  Median :19.00   Median : 1.0000   Median :1.000   Mode  :character  
##  Mean   :19.14   Mean   : 0.5131   Mean   :1.406                     
##  3rd Qu.:19.00   3rd Qu.: 1.0000   3rd Qu.:1.000                     
##  Max.   :41.00   Max.   : 1.0000   Max.   :6.000                     
##  NA's   :1                         NA's   :1                         
##     minor                gpa            class         PIL_total    
##  Length:267         Min.   :1.000   Min.   :1.000   Min.   : 60.0  
##  Class :character   1st Qu.:2.900   1st Qu.:1.000   1st Qu.:103.0  
##  Mode  :character   Median :3.200   Median :1.000   Median :111.0  
##                     Mean   :3.211   Mean   :1.584   Mean   :110.7  
##                     3rd Qu.:3.600   3rd Qu.:2.000   3rd Qu.:121.0  
##                     Max.   :4.600   Max.   :4.000   Max.   :138.0  
##                     NA's   :46                                     
##    CESD_total   AUDIT_TOTAL_NEW  DAST_TOTAL_NEW 
##  Min.   : 0.0   Min.   : 0.000   Min.   :0.000  
##  1st Qu.: 7.0   1st Qu.: 2.000   1st Qu.:0.000  
##  Median :11.0   Median : 5.000   Median :0.000  
##  Mean   :13.2   Mean   : 6.807   Mean   :0.906  
##  3rd Qu.:17.0   3rd Qu.:11.000   3rd Qu.:1.000  
##  Max.   :47.0   Max.   :31.000   Max.   :9.000  
##                                  NA's   :1

# More detailed descriptives using psych package
psych::describe(regression_data)

Visualizing Distributions

# Create histograms for each numeric variable
regression_data[, c("CESD_total", "PIL_total", "AUDIT_TOTAL_NEW", "DAST_TOTAL_NEW")] %>%
  as.data.frame() %>%
  pivot_longer(everything()) %>%
  ggplot(aes(x = value)) +
  geom_histogram(bins = 30, fill = "steelblue", color = "black", alpha = 0.7) +
  facet_wrap(~name, scales = "free") +
  theme_bw() +
  labs(title = "Distributions of Regression Variables",
       x = "Score", y = "Frequency")

Check for Missing Data

# Count missing values
colSums(is.na(regression_data))

##             age          gender        ethnicit           major           minor 
##               1               0               1               0               0 
##             gpa           class       PIL_total      CESD_total AUDIT_TOTAL_NEW 
##              46               0               0               0               0 
##  DAST_TOTAL_NEW 
##               1

# Percentage missing
colMeans(is.na(regression_data)) * 100

##             age          gender        ethnicit           major           minor 
##       0.3745318       0.0000000       0.3745318       0.0000000       0.0000000 
##             gpa           class       PIL_total      CESD_total AUDIT_TOTAL_NEW 
##      17.2284644       0.0000000       0.0000000       0.0000000       0.0000000 
##  DAST_TOTAL_NEW 
##       0.3745318

Part 3: Simple Linear Regression

Correlation as a Starting Point

Before fitting regression, let’s examine correlations between depression (CESD) and our predictors:

# Pearson correlations with p-values using Hmisc
# Create correlation matrix
numeric_vars <- c("CESD_total", "PIL_total", "AUDIT_TOTAL_NEW", "DAST_TOTAL_NEW")
corr_matrix <- cor(regression_data[, numeric_vars], use = "complete.obs")
print(corr_matrix)

##                  CESD_total  PIL_total AUDIT_TOTAL_NEW DAST_TOTAL_NEW
## CESD_total       1.00000000 -0.5846550      0.09594426      0.2169417
## PIL_total       -0.58465498  1.0000000     -0.13103854     -0.1680770
## AUDIT_TOTAL_NEW  0.09594426 -0.1310385      1.00000000      0.5194087
## DAST_TOTAL_NEW   0.21694168 -0.1680770      0.51940873      1.0000000

# For p-values, use a simple approach with cor.test
cat("\nPearson Correlations with Depression (CESD_total):\n")

## 
## Pearson Correlations with Depression (CESD_total):

for (var in numeric_vars[-1]) {  # -1 excludes CESD
  test <- cor.test(regression_data$CESD_total, regression_data[[var]])
  cat(var, ": r =", round(test$estimate, 3), ", p =", round(test$p.value, 4), "\n")
}

## PIL_total : r = -0.58 , p = 0 
## AUDIT_TOTAL_NEW : r = 0.093 , p = 0.1295 
## DAST_TOTAL_NEW : r = 0.217 , p = 4e-04

Fitting a Simple Regression Model

Let’s predict depression (CESD_total) from purpose in life (PIL_total):

# Create a clean dataset with no missing values for regression comparison
regression_data_clean <- regression_data %>%
  dplyr::select(CESD_total, PIL_total, AUDIT_TOTAL_NEW, DAST_TOTAL_NEW) %>%
  drop_na()

# Fit the simple regression model on clean data
model_simple <- lm(CESD_total ~ PIL_total, data = regression_data_clean)

# View results
summary(model_simple)

## 
## Call:
## lm(formula = CESD_total ~ PIL_total, data = regression_data_clean)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -15.928  -5.385  -1.522   3.213  29.521 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 56.81537    3.75199   15.14   <2e-16 ***
## PIL_total   -0.39368    0.03362  -11.71   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.655 on 264 degrees of freedom
## Multiple R-squared:  0.3418, Adjusted R-squared:  0.3393 
## F-statistic: 137.1 on 1 and 264 DF,  p-value: < 2.2e-16

Interpreting the Output

The key elements of the summary are:

Intercept (b₀): When PIL_total = 0, predicted depression = intercept value
PIL_total coefficient (b₁): For each 1-unit increase in purpose in life, depression decreases by this amount
Standard Error: Uncertainty in the coefficient estimate
t-value & Pr(>|t|): Tests whether coefficient differs from zero
R-squared (R²): Proportion of variance in depression explained by purpose in life
F-statistic: Tests overall model significance

Plain English (F-statistic): The F-ratio is a measure of the ratio of the variation explained by the model and the variation explained by unsystematic factors. In other words, it is the ratio of how good the model is against how bad it is (how much error there is). If the F-ratio is less than 1, it means there is more error than explained variance.

Plain English (t-statistic): The t-statistic tests whether a specific predictor contributes significantly to the model. Like F, it is based on the ratio of explained variance against unexplained variance, but for a single variable. It tests whether the coefficient (b-value) is significantly different from 0.

Plain English (R-squared): \(R^2\) is a measure of how much of the variability in the outcome is accounted for by the predictors. Think of it as the “percentage of the story” that our model explains. If \(R^2 = 0.33\), it means our model explains 33% of the differences in depression scores between people. The remaining 67% is unexplained variance (due to factors we didn’t measure).

Plain English (Adjusted R-squared): \(R^2\) tends to get bigger just by adding more variables, even if they are useless. Adjusted \(R^2\) corrects for this. It estimates how well your model would work in the “real world” (the population) rather than just your specific sample. It’s usually a bit lower than regular \(R^2\).

Plain English (Standard Error): The Standard Error (SE) tells us how much the coefficient (b-value) might vary if we took a different sample of people. A small SE means we are very confident in our estimate of the relationship. A large SE means our estimate is wobbly and uncertain.

Plain English (p-value): In regression, the p-value tells us the probability of seeing a relationship this strong if there really was no relationship in the real world. If \(p < .05\), we conclude the relationship is “statistically significant” (real enough to publish).

# Extract key values for interpretation
cat("Simple Regression: CESD ~ PIL\n")

## Simple Regression: CESD ~ PIL

cat("Intercept:", coef(model_simple)[1], "\n")

## Intercept: 56.81537

cat("PIL_total slope:", coef(model_simple)[2], "\n")

## PIL_total slope: -0.3936757

cat("R-squared:", summary(model_simple)$r.squared, "\n")

## R-squared: 0.3418214

cat("Adjusted R-squared:", summary(model_simple)$adj.r.squared, "\n")

## Adjusted R-squared: 0.3393283

Visualizing the Regression Line

# Create cleanup theme
cleanup <- theme(panel.grid.major = element_blank(),
                panel.grid.minor = element_blank(),
                panel.background = element_blank(),
                axis.line.x = element_line(color = 'black'),
                axis.line.y = element_line(color = 'black'),
                legend.key = element_rect(fill = 'white'),
                text = element_text(size = 12))

# Plot with regression line
ggplot(regression_data, aes(x = PIL_total, y = CESD_total)) +
  geom_point(alpha = 0.5, size = 3) +
  geom_smooth(method = "lm", se = TRUE, color = "black", fill = "grey80") +
  xlab("Purpose in Life (PIL)") +
  ylab("Depression Score (CESD)") +
  ggtitle("Simple Regression: Depression Predicted by Purpose in Life") +
  cleanup +
  theme_bw()

Part 4: Multiple Regression

Adding More Predictors

Depression likely has multiple causes. Let’s add alcohol and drug use as predictors:

# Fit multiple regression model on the same clean data
model_multiple <- lm(CESD_total ~ PIL_total + AUDIT_TOTAL_NEW + DAST_TOTAL_NEW, 
                     data = regression_data_clean)

# View results
summary(model_multiple)

## 
## Call:
## lm(formula = CESD_total ~ PIL_total + AUDIT_TOTAL_NEW + DAST_TOTAL_NEW, 
##     data = regression_data_clean)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -19.619  -5.172  -1.256   3.420  28.847 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     55.16374    3.88905  14.184   <2e-16 ***
## PIL_total       -0.38158    0.03384 -11.276   <2e-16 ***
## AUDIT_TOTAL_NEW -0.09211    0.09391  -0.981    0.328    
## DAST_TOTAL_NEW   1.03415    0.39871   2.594    0.010 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.585 on 262 degrees of freedom
## Multiple R-squared:  0.3587, Adjusted R-squared:  0.3513 
## F-statistic: 48.84 on 3 and 262 DF,  p-value: < 2.2e-16

Comparing Models

We can compare the simple and multiple models:

# ANOVA comparison
anova(model_simple, model_multiple)

# Or use R² to compare
cat("Simple Model R²:", summary(model_simple)$r.squared, "\n")

## Simple Model R²: 0.3418214

cat("Multiple Model R²:", summary(model_multiple)$r.squared, "\n")

## Multiple Model R²: 0.3586693

cat("Improvement:", 
    summary(model_multiple)$r.squared - summary(model_simple)$r.squared, "\n")

## Improvement: 0.01684788

Standardized Coefficients

The regression coefficients above are unstandardized (also called raw coefficients). They depend on the scale of measurement. Often we want standardized coefficients (beta weights) which are on a common scale:

# Standardize the variables manually
regression_data_std <- regression_data_clean %>%
  dplyr::select(CESD_total, PIL_total, AUDIT_TOTAL_NEW, DAST_TOTAL_NEW) %>%
  mutate(across(everything(), scale))

# Fit model with standardized variables
model_multiple_std <- lm(CESD_total ~ PIL_total + AUDIT_TOTAL_NEW + DAST_TOTAL_NEW,
                         data = regression_data_std)

# Extract standardized coefficients
coef(model_multiple_std)

##     (Intercept)       PIL_total AUDIT_TOTAL_NEW  DAST_TOTAL_NEW 
##    7.461120e-17   -5.666880e-01   -5.686394e-02    1.512301e-01

Part 5: Regression Assumptions

Plain English (Assumptions): * Linearity: The relationship between predictors and the outcome is a straight line. * Homoscedasticity: At each level of the predictor, the spread (variance) of the “misses” (residuals) should be roughly the same. If the points fan out like a funnel, that’s bad (heteroscedasticity). * Independence: The errors for one person shouldn’t be related to the errors for another person. * Normality: The errors should be normally distributed (bell curve).

Assumption 1: Linearity

par(mfrow = c(1, 3))
plot(regression_data_clean$PIL_total, regression_data_clean$CESD_total,
     main = "PIL vs CESD", xlab = "PIL", ylab = "CESD")
abline(lm(CESD_total ~ PIL_total, regression_data_clean), col = "red")

plot(regression_data_clean$AUDIT_TOTAL_NEW, regression_data_clean$CESD_total,
     main = "AUDIT vs CESD", xlab = "AUDIT", ylab = "CESD")
abline(lm(CESD_total ~ AUDIT_TOTAL_NEW, regression_data_clean), col = "red")

plot(regression_data_clean$DAST_TOTAL_NEW, regression_data_clean$CESD_total,
     main = "DAST vs CESD", xlab = "DAST", ylab = "CESD")
abline(lm(CESD_total ~ DAST_TOTAL_NEW, regression_data_clean), col = "red")

par(mfrow = c(1, 1))

Assumption 2: Normality of Residuals

Residuals (errors) should be normally distributed:

# Extract residuals
residuals_model <- residuals(model_multiple)

# Histogram of residuals
hist(residuals_model, breaks = 20, main = "Distribution of Residuals",
     xlab = "Residuals", ylab = "Frequency", col = "steelblue")

# Q-Q plot (quantile-quantile plot)
qqnorm(residuals_model, main = "Q-Q Plot of Residuals")
qqline(residuals_model, col = "red")

# Shapiro-Wilk test for normality
shapiro.test(residuals_model)

## 
##  Shapiro-Wilk normality test
## 
## data:  residuals_model
## W = 0.93583, p-value = 2.383e-09

Assumption 3: Homoscedasticity (Equal Variance)

The spread of residuals should be consistent across fitted values:

# Residuals vs Fitted Values plot
plot(model_multiple, which = 1, 
     main = "Residuals vs Fitted Values\n(should show random scatter)")

# Scale-Location plot
plot(model_multiple, which = 2,
     main = "Scale-Location Plot\n(check if points form a horizontal band)")

Assumption 4: Independence

Observations should be independent (especially important in time-series data). We can check the Durbin-Watson test:

# Durbin-Watson test (if applicable)
if(!require(lmtest)) install.packages("lmtest")
library(lmtest)
dwtest(model_multiple)

## 
##  Durbin-Watson test
## 
## data:  model_multiple
## DW = 1.5076, p-value = 2.291e-05
## alternative hypothesis: true autocorrelation is greater than 0

# Values close to 2 suggest independence
# < 2 suggests positive autocorrelation
# > 2 suggests negative autocorrelation

Part 6: Detecting Outliers & Influential Cases

Outliers

Outliers are extreme observations that deviate from the pattern:

Plain English: An outlier is a case that differs substantially from the main trend of the data. Imagine a regression predicting weight from height; a person who is very short but extremely heavy would be an outlier. These “weird” cases can bias the model, pulling the regression line towards them and making it less accurate for everyone else.

# Standardized residuals (residuals divided by SD)
std_residuals <- rstandard(model_multiple)

# Cases with |std residual| > 2.5 are potentially problematic
outlier_cases <- which(abs(std_residuals) > 2.5)
cat("Potential outliers (std residuals > 2.5):\n")

## Potential outliers (std residuals > 2.5):

print(outlier_cases)

##   1  34  49  66 164 167 195 197 244 
##   1  34  49  66 164 167 195 197 244

# Plot standardized residuals
plot(std_residuals, 
     main = "Standardized Residuals",
     ylab = "Standardized Residuals",
     xlab = "Observation")
abline(h = c(-2.5, 2.5), col = "red", lty = 2)

Leverage

Leverage measures how extreme a case is on the predictor variables:

# Leverage (hat values)
leverage <- hatvalues(model_multiple)

# Average leverage = (p+1)/n where p = number of predictors
p <- 3  # number of predictors
n <- nrow(regression_data_clean)
avg_leverage <- (p + 1) / n

cat("Average leverage:", avg_leverage, "\n")

## Average leverage: 0.01503759

cat("Cases with high leverage (>", 2*avg_leverage, "):\n")

## Cases with high leverage (> 0.03007519 ):

high_leverage <- which(leverage > 2*avg_leverage)
print(high_leverage)

##   2  34  39  46  50  63  65  68  70  78  84  89 176 201 203 215 231 244 245 
##   2  34  39  46  50  63  65  68  70  78  84  89 176 201 203 215 231 244 245

# Plot leverage
plot(leverage, main = "Leverage Values", ylab = "Leverage", xlab = "Observation")
abline(h = 2*avg_leverage, col = "red", lty = 2)

Cook’s Distance

Cook’s Distance measures influential cases (high leverage AND high residuals):

# Cook's D
cooks_d <- cooks.distance(model_multiple)

# Threshold is often 4/(n-p-1)
threshold <- 4 / (n - p - 1)

cat("Threshold for Cook's D:", threshold, "\n")

## Threshold for Cook's D: 0.01526718

cat("Influential cases:\n")

## Influential cases:

influential <- which(cooks_d > threshold)
print(influential)

##   2  34  36  49  65  66  68  78 164 167 195 197 199 244 245 
##   2  34  36  49  65  66  68  78 164 167 195 197 199 244 245

# Plot Cook's D
plot(cooks_d, main = "Cook's Distance", ylab = "Cook's Distance", xlab = "Observation")
abline(h = threshold, col = "red", lty = 2)

All Diagnostics Together

# Comprehensive diagnostic plots
par(mfrow = c(2, 2))
plot(model_multiple)

par(mfrow = c(1, 1))

Part 7: Multicollinearity

When predictors are highly correlated with each other, it creates multicollinearity, which inflates standard errors and reduces precision:

Plain English (Multicollinearity): Multicollinearity means that the predictor variables are too closely related to each other. Crudely put, it means the b-values (coefficients) are less trustworthy. As Andy Field says: “Don’t lend them money and don’t let them go for dinner with your boy- or girlfriend.” If variables are too similar, the model can’t tell which one is actually doing the predicting.

# Variance Inflation Factor (VIF)
# VIF > 10 indicates problematic multicollinearity
# VIF > 5 is concerning
vif(model_multiple)

##       PIL_total AUDIT_TOTAL_NEW  DAST_TOTAL_NEW 
##        1.031853        1.373162        1.388818

# Correlation matrix of predictors
predictor_corr <- regression_data_clean %>%
  dplyr::select(PIL_total, AUDIT_TOTAL_NEW, DAST_TOTAL_NEW) %>%
  cor()

print(predictor_corr)

##                  PIL_total AUDIT_TOTAL_NEW DAST_TOTAL_NEW
## PIL_total        1.0000000      -0.1310385     -0.1680770
## AUDIT_TOTAL_NEW -0.1310385       1.0000000      0.5194087
## DAST_TOTAL_NEW  -0.1680770       0.5194087      1.0000000

# Visualize correlations
corrplot(predictor_corr, method = "circle", 
         addCoef.col = "black", diag = FALSE,
         title = "Correlations Among Predictors")

Part 8: Hierarchical Regression

Hierarchical regression allows us to test whether adding predictors significantly improves model fit:

# Model 1: PIL only
model_h1 <- lm(CESD_total ~ PIL_total, data = regression_data_clean)

# Model 2: Add substance use
model_h2 <- lm(CESD_total ~ PIL_total + AUDIT_TOTAL_NEW + DAST_TOTAL_NEW, 
               data = regression_data_clean)

# Compare models
cat("Model 1 R²:", summary(model_h1)$r.squared, "\n")

## Model 1 R²: 0.3418214

cat("Model 2 R²:", summary(model_h2)$r.squared, "\n")

## Model 2 R²: 0.3586693

# F-test for improvement
anova(model_h1, model_h2)

# The p-value tells us if Model 2 is significantly better than Model 1

Part 9: Effect Sizes in Regression

R-squared

R² tells us the proportion of variance in Y explained by the model:

r2 <- summary(model_multiple)$r.squared
cat("R-squared:", r2, "\n")

## R-squared: 0.3586693

cat("This means", r2*100, "% of variance in depression is explained by the model\n")

## This means 35.86693 % of variance in depression is explained by the model

# Adjusted R² penalizes for number of predictors
adj_r2 <- summary(model_multiple)$adj.r.squared
cat("Adjusted R-squared:", adj_r2, "\n")

## Adjusted R-squared: 0.3513258

Semi-partial (Part) Correlations

These correlations show the unique contribution of each predictor, removing the shared variance with other predictors:

# Semi-partial correlations
spcor_results <- spcor.test(regression_data_clean$CESD_total, 
                            regression_data_clean$PIL_total,
                            as.matrix(regression_data_clean[, c("AUDIT_TOTAL_NEW", "DAST_TOTAL_NEW")]))
cat("Semi-partial correlation for PIL_total:", spcor_results$estimate, "\n")

## Semi-partial correlation for PIL_total: -0.5578727

cat("p-value:", spcor_results$p.value, "\n")

## p-value: 5.378433e-23

# Semi-partial correlation squared = unique variance explained
cat("Unique variance explained by PIL_total:", spcor_results$estimate^2 * 100, "%\n")

## Unique variance explained by PIL_total: 31.1222 %

Part 10: APA-Formatted Output

Regression Table

# Extract model summary
model_summary <- summary(model_multiple)

# Create APA table
regression_table <- data.frame(
  Predictor = c("Intercept", "PIL_total", "AUDIT_TOTAL_NEW", "DAST_TOTAL_NEW"),
  b = coef(model_multiple),
  SE = model_summary$coefficients[, 2],
  t = model_summary$coefficients[, 3],
  p = model_summary$coefficients[, 4]
)

# Format for APA
kable(regression_table, 
      digits = 3,
      caption = "Table 1. Multiple Regression Predicting Depression from Purpose in Life, Alcohol Use, and Drug Use",
      format = "html") %>%
  kable_styling("striped")

Table 1. Multiple Regression Predicting Depression from Purpose in Life, Alcohol Use, and Drug Use
	Predictor	b	SE	t	p
(Intercept)	Intercept	55.164	3.889	14.184	0.000
PIL_total	PIL_total	-0.382	0.034	-11.276	0.000
AUDIT_TOTAL_NEW	AUDIT_TOTAL_NEW	-0.092	0.094	-0.981	0.328
DAST_TOTAL_NEW	DAST_TOTAL_NEW	1.034	0.399	2.594	0.010

cat("\nNote. R² =", round(model_summary$r.squared, 3), 
    ", adjusted R² =", round(model_summary$adj.r.squared, 3), "\n")

## 
## Note. R² = 0.359 , adjusted R² = 0.351

Model Summary Table

# Compare models
models_comparison <- data.frame(
  Model = c("1. PIL only", "2. PIL + Substance Use"),
  R2 = c(summary(model_h1)$r.squared, summary(model_h2)$r.squared),
  Adj_R2 = c(summary(model_h1)$adj.r.squared, summary(model_h2)$adj.r.squared),
  F = c(summary(model_h1)$fstatistic[1], summary(model_h2)$fstatistic[1])
)

kable(models_comparison,
      digits = 3,
      caption = "Table 2. Comparison of Regression Models") %>%
  kable_styling("striped")

Table 2. Comparison of Regression Models
Model	R2	Adj_R2	F
PIL only	0.342	0.339	137.107
PIL + Substance Use	0.359	0.351	48.842

Part 11: When Assumptions Are Violated

What If Data Are Non-Normal?

If residuals are non-normally distributed: - Consider data transformations (log, square root) - Use robust regression methods - Report bootstrapped confidence intervals

What If We Have Heteroscedasticity?

If variance of residuals is unequal: - Use weighted least squares regression - Report heteroscedasticity-consistent standard errors - Consider transforming the outcome variable

What If We Have Outliers/Influential Cases?

Examine why the case is unusual
Check for data entry errors
Consider robust regression that downweights influential cases
Report results with and without outliers for comparison

Summary & Key Points

✓ Regression extends correlation by allowing prediction and testing directional relationships

✓ Simple regression involves one predictor; multiple regression involves several

✓ Interpretation depends on scaling: Unstandardized coefficients (b) vs. standardized (β)

✓ R² measures model fit: The proportion of outcome variance explained

✓ Always check assumptions: Linearity, normality, homoscedasticity, independence

✓ Diagnose problems: Use plots, tests (Shapiro-Wilk, VIF), and distance measures (Cook’s D, leverage)

✓ Hierarchical regression tests whether added predictors improve model fit

✓ Effect sizes (R², semi-partial correlations) quantify the practical importance of relationships

References

[1] Field, A., Miles, J., & Field, Z. (2012). Discovering statistics using R. Sage Publications.

Further Resources

Textbook: Chapter 7 in Field, Miles, & Field (2012)
Visualization: Use plot(model) for quick diagnostic checks
Robust methods: See rlm() in the MASS package for robust regression
Cross-validation: Use caret package for evaluating model generalization

Tutorial created for ANLY500: Analytics I Enter Your Name | Date: 2026-01-29