Machine: MEL_DEL - Windows 10 x64
Random Seed Set To: 3284
Regression analysis is one of the most powerful and widely-used statistical techniques in data science. While correlation analysis (covered in Week 08) tells us whether two variables are related, regression analysis goes further: it allows us to predict one variable from one or more other variables, and to understand the strength of those relationships.
In this practical tutorial, we’ll learn how to conduct regression analysis using R, working with real data about mental health outcomes. By the end of this session, you should be able to:
We’ll use several R packages for this analysis. Install them if needed:
install.packages("tidyverse") # Data manipulation and ggplot2
install.packages("rio") # Easy data import (CSV, SPSS, Excel)
install.packages("car") # Diagnostics (VIF, leverage, Cook's D)
install.packages("corrplot") # Correlation visualization
install.packages("psych") # Descriptive statistics
install.packages("ppcor") # Partial & semi-partial correlations
Load libraries:
library(tidyverse) # ggplot2, dplyr
library(rio) # import() function
library(car) # vif(), influencePlot()
library(corrplot) # Visual correlation matrices
library(psych) # Descriptive statistics
library(ppcor) # spcor(), ppcor() for partial correlations
library(knitr) # Markdown tables
library(kableExtra) # Table formatting
Regression is a statistical technique for modeling the relationship between variables. In simple linear regression, we have: - One independent variable (predictor, X) - One dependent variable (outcome, Y)
We fit a straight line through the data to predict Y from X. The line is described by the equation:
\[\hat{Y}_i = b_0 + b_1X_i + \varepsilon_i\]
Where: - \(\hat{Y}_i\) = predicted value for person i - \(b_0\) = intercept (where the line crosses the y-axis when X=0) - \(b_1\) = slope (how much Y changes for each unit increase in X) - \(\varepsilon_i\) = error (residual) for person i
Regression finds the “best fit” line by minimizing the sum of squared residuals (SSR):
\[SSR = \sum_{i=1}^{n} (Y_i - \hat{Y}_i)^2\]
This means we want predictions that are as close as possible to the actual data.
Plain English: Think of a residual as the “miss” or “error” of our prediction. If we predict a depression score of 20 but the person actually scored 25, the residual is 5. These residuals represent the error present in the model. If a model fits the data well, all residuals will be small. Regression tries to find the line that makes these “misses” as small as possible for everyone.
We’ll work with mental health data examining predictors of depression scores. The dataset contains: - CESD_total: Center for Epidemiologic Studies Depression Scale (our outcome) - PIL_total: Purpose in Life subscale (meaning) - AUDIT_TOTAL_NEW: Alcohol Use Disorder Identification Test - DAST_TOTAL_NEW: Drug Abuse Screening Test
# Import the regression data
regression_data <- import("data/regression_data.sav")
# Quick look at the data
head(regression_data)
dim(regression_data)
## [1] 267 11
str(regression_data)
## 'data.frame': 267 obs. of 11 variables:
## $ age : num 18 18 18 20 32 19 19 18 19 19 ...
## ..- attr(*, "format.spss")= chr "F8.2"
## $ gender : num -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
## ..- attr(*, "format.spss")= chr "F8.0"
## ..- attr(*, "labels")= Named num [1:2] -1 1
## .. ..- attr(*, "names")= chr [1:2] "male" "female"
## $ ethnicit : num 2 1 1 3 1 1 2 2 1 3 ...
## ..- attr(*, "format.spss")= chr "F8.0"
## ..- attr(*, "labels")= Named num [1:6] 1 2 3 4 5 6
## .. ..- attr(*, "names")= chr [1:6] "caucasian" "african amer" "hispanic" "asian/pacific islander" ...
## $ major : chr "political sci" "pre-pharm" "pre-med" "psychology" ...
## ..- attr(*, "format.spss")= chr "A39"
## $ minor : chr "" "" "undecided" "Spanish" ...
## ..- attr(*, "format.spss")= chr "A39"
## $ gpa : num 3 NA 3.5 3 3.5 3.8 3.2 4 NA 3.3 ...
## ..- attr(*, "format.spss")= chr "F8.2"
## $ class : num 1 1 1 3 4 2 2 1 2 1 ...
## ..- attr(*, "format.spss")= chr "F8.0"
## ..- attr(*, "labels")= Named num [1:4] 1 2 3 4
## .. ..- attr(*, "names")= chr [1:4] "freshman" "sophomore" "junior" "senior"
## $ PIL_total : num 121 76 98 122 99 134 102 124 126 112 ...
## ..- attr(*, "format.spss")= chr "F8.2"
## ..- attr(*, "display_width")= int 11
## $ CESD_total : num 28 37 20 15 7 7 27 10 9 8 ...
## ..- attr(*, "format.spss")= chr "F8.2"
## ..- attr(*, "display_width")= int 12
## $ AUDIT_TOTAL_NEW: num 1 5 3 3 2 3 2 1 1 7 ...
## ..- attr(*, "format.spss")= chr "F8.2"
## ..- attr(*, "display_width")= int 17
## $ DAST_TOTAL_NEW : num 0 0 0 1 0 0 1 0 0 1 ...
## ..- attr(*, "format.spss")= chr "F8.2"
## ..- attr(*, "display_width")= int 16
# Summary statistics
summary(regression_data)
## age gender ethnicit major
## Min. :18.00 Min. :-1.0000 Min. :1.000 Length:267
## 1st Qu.:18.00 1st Qu.: 1.0000 1st Qu.:1.000 Class :character
## Median :19.00 Median : 1.0000 Median :1.000 Mode :character
## Mean :19.14 Mean : 0.5131 Mean :1.406
## 3rd Qu.:19.00 3rd Qu.: 1.0000 3rd Qu.:1.000
## Max. :41.00 Max. : 1.0000 Max. :6.000
## NA's :1 NA's :1
## minor gpa class PIL_total
## Length:267 Min. :1.000 Min. :1.000 Min. : 60.0
## Class :character 1st Qu.:2.900 1st Qu.:1.000 1st Qu.:103.0
## Mode :character Median :3.200 Median :1.000 Median :111.0
## Mean :3.211 Mean :1.584 Mean :110.7
## 3rd Qu.:3.600 3rd Qu.:2.000 3rd Qu.:121.0
## Max. :4.600 Max. :4.000 Max. :138.0
## NA's :46
## CESD_total AUDIT_TOTAL_NEW DAST_TOTAL_NEW
## Min. : 0.0 Min. : 0.000 Min. :0.000
## 1st Qu.: 7.0 1st Qu.: 2.000 1st Qu.:0.000
## Median :11.0 Median : 5.000 Median :0.000
## Mean :13.2 Mean : 6.807 Mean :0.906
## 3rd Qu.:17.0 3rd Qu.:11.000 3rd Qu.:1.000
## Max. :47.0 Max. :31.000 Max. :9.000
## NA's :1
# More detailed descriptives using psych package
psych::describe(regression_data)
# Create histograms for each numeric variable
regression_data[, c("CESD_total", "PIL_total", "AUDIT_TOTAL_NEW", "DAST_TOTAL_NEW")] %>%
as.data.frame() %>%
pivot_longer(everything()) %>%
ggplot(aes(x = value)) +
geom_histogram(bins = 30, fill = "steelblue", color = "black", alpha = 0.7) +
facet_wrap(~name, scales = "free") +
theme_bw() +
labs(title = "Distributions of Regression Variables",
x = "Score", y = "Frequency")
# Count missing values
colSums(is.na(regression_data))
## age gender ethnicit major minor
## 1 0 1 0 0
## gpa class PIL_total CESD_total AUDIT_TOTAL_NEW
## 46 0 0 0 0
## DAST_TOTAL_NEW
## 1
# Percentage missing
colMeans(is.na(regression_data)) * 100
## age gender ethnicit major minor
## 0.3745318 0.0000000 0.3745318 0.0000000 0.0000000
## gpa class PIL_total CESD_total AUDIT_TOTAL_NEW
## 17.2284644 0.0000000 0.0000000 0.0000000 0.0000000
## DAST_TOTAL_NEW
## 0.3745318
Before fitting regression, let’s examine correlations between depression (CESD) and our predictors:
# Pearson correlations with p-values using Hmisc
# Create correlation matrix
numeric_vars <- c("CESD_total", "PIL_total", "AUDIT_TOTAL_NEW", "DAST_TOTAL_NEW")
corr_matrix <- cor(regression_data[, numeric_vars], use = "complete.obs")
print(corr_matrix)
## CESD_total PIL_total AUDIT_TOTAL_NEW DAST_TOTAL_NEW
## CESD_total 1.00000000 -0.5846550 0.09594426 0.2169417
## PIL_total -0.58465498 1.0000000 -0.13103854 -0.1680770
## AUDIT_TOTAL_NEW 0.09594426 -0.1310385 1.00000000 0.5194087
## DAST_TOTAL_NEW 0.21694168 -0.1680770 0.51940873 1.0000000
# For p-values, use a simple approach with cor.test
cat("\nPearson Correlations with Depression (CESD_total):\n")
##
## Pearson Correlations with Depression (CESD_total):
for (var in numeric_vars[-1]) { # -1 excludes CESD
test <- cor.test(regression_data$CESD_total, regression_data[[var]])
cat(var, ": r =", round(test$estimate, 3), ", p =", round(test$p.value, 4), "\n")
}
## PIL_total : r = -0.58 , p = 0
## AUDIT_TOTAL_NEW : r = 0.093 , p = 0.1295
## DAST_TOTAL_NEW : r = 0.217 , p = 4e-04
Let’s predict depression (CESD_total) from purpose in life (PIL_total):
# Create a clean dataset with no missing values for regression comparison
regression_data_clean <- regression_data %>%
dplyr::select(CESD_total, PIL_total, AUDIT_TOTAL_NEW, DAST_TOTAL_NEW) %>%
drop_na()
# Fit the simple regression model on clean data
model_simple <- lm(CESD_total ~ PIL_total, data = regression_data_clean)
# View results
summary(model_simple)
##
## Call:
## lm(formula = CESD_total ~ PIL_total, data = regression_data_clean)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.928 -5.385 -1.522 3.213 29.521
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 56.81537 3.75199 15.14 <2e-16 ***
## PIL_total -0.39368 0.03362 -11.71 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.655 on 264 degrees of freedom
## Multiple R-squared: 0.3418, Adjusted R-squared: 0.3393
## F-statistic: 137.1 on 1 and 264 DF, p-value: < 2.2e-16
The key elements of the summary are:
Plain English (F-statistic): The F-ratio is a measure of the ratio of the variation explained by the model and the variation explained by unsystematic factors. In other words, it is the ratio of how good the model is against how bad it is (how much error there is). If the F-ratio is less than 1, it means there is more error than explained variance.
Plain English (t-statistic): The t-statistic tests whether a specific predictor contributes significantly to the model. Like F, it is based on the ratio of explained variance against unexplained variance, but for a single variable. It tests whether the coefficient (b-value) is significantly different from 0.
Plain English (R-squared): \(R^2\) is a measure of how much of the variability in the outcome is accounted for by the predictors. Think of it as the “percentage of the story” that our model explains. If \(R^2 = 0.33\), it means our model explains 33% of the differences in depression scores between people. The remaining 67% is unexplained variance (due to factors we didn’t measure).
Plain English (Adjusted R-squared): \(R^2\) tends to get bigger just by adding more variables, even if they are useless. Adjusted \(R^2\) corrects for this. It estimates how well your model would work in the “real world” (the population) rather than just your specific sample. It’s usually a bit lower than regular \(R^2\).
Plain English (Standard Error): The Standard Error (SE) tells us how much the coefficient (b-value) might vary if we took a different sample of people. A small SE means we are very confident in our estimate of the relationship. A large SE means our estimate is wobbly and uncertain.
Plain English (p-value): In regression, the p-value tells us the probability of seeing a relationship this strong if there really was no relationship in the real world. If \(p < .05\), we conclude the relationship is “statistically significant” (real enough to publish).
# Extract key values for interpretation
cat("Simple Regression: CESD ~ PIL\n")
## Simple Regression: CESD ~ PIL
cat("Intercept:", coef(model_simple)[1], "\n")
## Intercept: 56.81537
cat("PIL_total slope:", coef(model_simple)[2], "\n")
## PIL_total slope: -0.3936757
cat("R-squared:", summary(model_simple)$r.squared, "\n")
## R-squared: 0.3418214
cat("Adjusted R-squared:", summary(model_simple)$adj.r.squared, "\n")
## Adjusted R-squared: 0.3393283
# Create cleanup theme
cleanup <- theme(panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
panel.background = element_blank(),
axis.line.x = element_line(color = 'black'),
axis.line.y = element_line(color = 'black'),
legend.key = element_rect(fill = 'white'),
text = element_text(size = 12))
# Plot with regression line
ggplot(regression_data, aes(x = PIL_total, y = CESD_total)) +
geom_point(alpha = 0.5, size = 3) +
geom_smooth(method = "lm", se = TRUE, color = "black", fill = "grey80") +
xlab("Purpose in Life (PIL)") +
ylab("Depression Score (CESD)") +
ggtitle("Simple Regression: Depression Predicted by Purpose in Life") +
cleanup +
theme_bw()
Depression likely has multiple causes. Let’s add alcohol and drug use as predictors:
# Fit multiple regression model on the same clean data
model_multiple <- lm(CESD_total ~ PIL_total + AUDIT_TOTAL_NEW + DAST_TOTAL_NEW,
data = regression_data_clean)
# View results
summary(model_multiple)
##
## Call:
## lm(formula = CESD_total ~ PIL_total + AUDIT_TOTAL_NEW + DAST_TOTAL_NEW,
## data = regression_data_clean)
##
## Residuals:
## Min 1Q Median 3Q Max
## -19.619 -5.172 -1.256 3.420 28.847
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 55.16374 3.88905 14.184 <2e-16 ***
## PIL_total -0.38158 0.03384 -11.276 <2e-16 ***
## AUDIT_TOTAL_NEW -0.09211 0.09391 -0.981 0.328
## DAST_TOTAL_NEW 1.03415 0.39871 2.594 0.010 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.585 on 262 degrees of freedom
## Multiple R-squared: 0.3587, Adjusted R-squared: 0.3513
## F-statistic: 48.84 on 3 and 262 DF, p-value: < 2.2e-16
We can compare the simple and multiple models:
# ANOVA comparison
anova(model_simple, model_multiple)
# Or use R² to compare
cat("Simple Model R²:", summary(model_simple)$r.squared, "\n")
## Simple Model R²: 0.3418214
cat("Multiple Model R²:", summary(model_multiple)$r.squared, "\n")
## Multiple Model R²: 0.3586693
cat("Improvement:",
summary(model_multiple)$r.squared - summary(model_simple)$r.squared, "\n")
## Improvement: 0.01684788
The regression coefficients above are unstandardized (also called raw coefficients). They depend on the scale of measurement. Often we want standardized coefficients (beta weights) which are on a common scale:
# Standardize the variables manually
regression_data_std <- regression_data_clean %>%
dplyr::select(CESD_total, PIL_total, AUDIT_TOTAL_NEW, DAST_TOTAL_NEW) %>%
mutate(across(everything(), scale))
# Fit model with standardized variables
model_multiple_std <- lm(CESD_total ~ PIL_total + AUDIT_TOTAL_NEW + DAST_TOTAL_NEW,
data = regression_data_std)
# Extract standardized coefficients
coef(model_multiple_std)
## (Intercept) PIL_total AUDIT_TOTAL_NEW DAST_TOTAL_NEW
## 7.461120e-17 -5.666880e-01 -5.686394e-02 1.512301e-01
Plain English (Assumptions): * Linearity: The relationship between predictors and the outcome is a straight line. * Homoscedasticity: At each level of the predictor, the spread (variance) of the “misses” (residuals) should be roughly the same. If the points fan out like a funnel, that’s bad (heteroscedasticity). * Independence: The errors for one person shouldn’t be related to the errors for another person. * Normality: The errors should be normally distributed (bell curve).
par(mfrow = c(1, 3))
plot(regression_data_clean$PIL_total, regression_data_clean$CESD_total,
main = "PIL vs CESD", xlab = "PIL", ylab = "CESD")
abline(lm(CESD_total ~ PIL_total, regression_data_clean), col = "red")
plot(regression_data_clean$AUDIT_TOTAL_NEW, regression_data_clean$CESD_total,
main = "AUDIT vs CESD", xlab = "AUDIT", ylab = "CESD")
abline(lm(CESD_total ~ AUDIT_TOTAL_NEW, regression_data_clean), col = "red")
plot(regression_data_clean$DAST_TOTAL_NEW, regression_data_clean$CESD_total,
main = "DAST vs CESD", xlab = "DAST", ylab = "CESD")
abline(lm(CESD_total ~ DAST_TOTAL_NEW, regression_data_clean), col = "red")
par(mfrow = c(1, 1))
Residuals (errors) should be normally distributed:
# Extract residuals
residuals_model <- residuals(model_multiple)
# Histogram of residuals
hist(residuals_model, breaks = 20, main = "Distribution of Residuals",
xlab = "Residuals", ylab = "Frequency", col = "steelblue")
# Q-Q plot (quantile-quantile plot)
qqnorm(residuals_model, main = "Q-Q Plot of Residuals")
qqline(residuals_model, col = "red")
# Shapiro-Wilk test for normality
shapiro.test(residuals_model)
##
## Shapiro-Wilk normality test
##
## data: residuals_model
## W = 0.93583, p-value = 2.383e-09
The spread of residuals should be consistent across fitted values:
# Residuals vs Fitted Values plot
plot(model_multiple, which = 1,
main = "Residuals vs Fitted Values\n(should show random scatter)")
# Scale-Location plot
plot(model_multiple, which = 2,
main = "Scale-Location Plot\n(check if points form a horizontal band)")
Observations should be independent (especially important in time-series data). We can check the Durbin-Watson test:
# Durbin-Watson test (if applicable)
if(!require(lmtest)) install.packages("lmtest")
library(lmtest)
dwtest(model_multiple)
##
## Durbin-Watson test
##
## data: model_multiple
## DW = 1.5076, p-value = 2.291e-05
## alternative hypothesis: true autocorrelation is greater than 0
# Values close to 2 suggest independence
# < 2 suggests positive autocorrelation
# > 2 suggests negative autocorrelation
Outliers are extreme observations that deviate from the pattern:
Plain English: An outlier is a case that differs substantially from the main trend of the data. Imagine a regression predicting weight from height; a person who is very short but extremely heavy would be an outlier. These “weird” cases can bias the model, pulling the regression line towards them and making it less accurate for everyone else.
# Standardized residuals (residuals divided by SD)
std_residuals <- rstandard(model_multiple)
# Cases with |std residual| > 2.5 are potentially problematic
outlier_cases <- which(abs(std_residuals) > 2.5)
cat("Potential outliers (std residuals > 2.5):\n")
## Potential outliers (std residuals > 2.5):
print(outlier_cases)
## 1 34 49 66 164 167 195 197 244
## 1 34 49 66 164 167 195 197 244
# Plot standardized residuals
plot(std_residuals,
main = "Standardized Residuals",
ylab = "Standardized Residuals",
xlab = "Observation")
abline(h = c(-2.5, 2.5), col = "red", lty = 2)
Leverage measures how extreme a case is on the predictor variables:
# Leverage (hat values)
leverage <- hatvalues(model_multiple)
# Average leverage = (p+1)/n where p = number of predictors
p <- 3 # number of predictors
n <- nrow(regression_data_clean)
avg_leverage <- (p + 1) / n
cat("Average leverage:", avg_leverage, "\n")
## Average leverage: 0.01503759
cat("Cases with high leverage (>", 2*avg_leverage, "):\n")
## Cases with high leverage (> 0.03007519 ):
high_leverage <- which(leverage > 2*avg_leverage)
print(high_leverage)
## 2 34 39 46 50 63 65 68 70 78 84 89 176 201 203 215 231 244 245
## 2 34 39 46 50 63 65 68 70 78 84 89 176 201 203 215 231 244 245
# Plot leverage
plot(leverage, main = "Leverage Values", ylab = "Leverage", xlab = "Observation")
abline(h = 2*avg_leverage, col = "red", lty = 2)
Cook’s Distance measures influential cases (high leverage AND high residuals):
# Cook's D
cooks_d <- cooks.distance(model_multiple)
# Threshold is often 4/(n-p-1)
threshold <- 4 / (n - p - 1)
cat("Threshold for Cook's D:", threshold, "\n")
## Threshold for Cook's D: 0.01526718
cat("Influential cases:\n")
## Influential cases:
influential <- which(cooks_d > threshold)
print(influential)
## 2 34 36 49 65 66 68 78 164 167 195 197 199 244 245
## 2 34 36 49 65 66 68 78 164 167 195 197 199 244 245
# Plot Cook's D
plot(cooks_d, main = "Cook's Distance", ylab = "Cook's Distance", xlab = "Observation")
abline(h = threshold, col = "red", lty = 2)
# Comprehensive diagnostic plots
par(mfrow = c(2, 2))
plot(model_multiple)
par(mfrow = c(1, 1))
When predictors are highly correlated with each other, it creates multicollinearity, which inflates standard errors and reduces precision:
Plain English (Multicollinearity): Multicollinearity means that the predictor variables are too closely related to each other. Crudely put, it means the b-values (coefficients) are less trustworthy. As Andy Field says: “Don’t lend them money and don’t let them go for dinner with your boy- or girlfriend.” If variables are too similar, the model can’t tell which one is actually doing the predicting.
# Variance Inflation Factor (VIF)
# VIF > 10 indicates problematic multicollinearity
# VIF > 5 is concerning
vif(model_multiple)
## PIL_total AUDIT_TOTAL_NEW DAST_TOTAL_NEW
## 1.031853 1.373162 1.388818
# Correlation matrix of predictors
predictor_corr <- regression_data_clean %>%
dplyr::select(PIL_total, AUDIT_TOTAL_NEW, DAST_TOTAL_NEW) %>%
cor()
print(predictor_corr)
## PIL_total AUDIT_TOTAL_NEW DAST_TOTAL_NEW
## PIL_total 1.0000000 -0.1310385 -0.1680770
## AUDIT_TOTAL_NEW -0.1310385 1.0000000 0.5194087
## DAST_TOTAL_NEW -0.1680770 0.5194087 1.0000000
# Visualize correlations
corrplot(predictor_corr, method = "circle",
addCoef.col = "black", diag = FALSE,
title = "Correlations Among Predictors")
Hierarchical regression allows us to test whether adding predictors significantly improves model fit:
# Model 1: PIL only
model_h1 <- lm(CESD_total ~ PIL_total, data = regression_data_clean)
# Model 2: Add substance use
model_h2 <- lm(CESD_total ~ PIL_total + AUDIT_TOTAL_NEW + DAST_TOTAL_NEW,
data = regression_data_clean)
# Compare models
cat("Model 1 R²:", summary(model_h1)$r.squared, "\n")
## Model 1 R²: 0.3418214
cat("Model 2 R²:", summary(model_h2)$r.squared, "\n")
## Model 2 R²: 0.3586693
# F-test for improvement
anova(model_h1, model_h2)
# The p-value tells us if Model 2 is significantly better than Model 1
R² tells us the proportion of variance in Y explained by the model:
r2 <- summary(model_multiple)$r.squared
cat("R-squared:", r2, "\n")
## R-squared: 0.3586693
cat("This means", r2*100, "% of variance in depression is explained by the model\n")
## This means 35.86693 % of variance in depression is explained by the model
# Adjusted R² penalizes for number of predictors
adj_r2 <- summary(model_multiple)$adj.r.squared
cat("Adjusted R-squared:", adj_r2, "\n")
## Adjusted R-squared: 0.3513258
These correlations show the unique contribution of each predictor, removing the shared variance with other predictors:
# Semi-partial correlations
spcor_results <- spcor.test(regression_data_clean$CESD_total,
regression_data_clean$PIL_total,
as.matrix(regression_data_clean[, c("AUDIT_TOTAL_NEW", "DAST_TOTAL_NEW")]))
cat("Semi-partial correlation for PIL_total:", spcor_results$estimate, "\n")
## Semi-partial correlation for PIL_total: -0.5578727
cat("p-value:", spcor_results$p.value, "\n")
## p-value: 5.378433e-23
# Semi-partial correlation squared = unique variance explained
cat("Unique variance explained by PIL_total:", spcor_results$estimate^2 * 100, "%\n")
## Unique variance explained by PIL_total: 31.1222 %
# Extract model summary
model_summary <- summary(model_multiple)
# Create APA table
regression_table <- data.frame(
Predictor = c("Intercept", "PIL_total", "AUDIT_TOTAL_NEW", "DAST_TOTAL_NEW"),
b = coef(model_multiple),
SE = model_summary$coefficients[, 2],
t = model_summary$coefficients[, 3],
p = model_summary$coefficients[, 4]
)
# Format for APA
kable(regression_table,
digits = 3,
caption = "Table 1. Multiple Regression Predicting Depression from Purpose in Life, Alcohol Use, and Drug Use",
format = "html") %>%
kable_styling("striped")
| Predictor | b | SE | t | p | |
|---|---|---|---|---|---|
| (Intercept) | Intercept | 55.164 | 3.889 | 14.184 | 0.000 |
| PIL_total | PIL_total | -0.382 | 0.034 | -11.276 | 0.000 |
| AUDIT_TOTAL_NEW | AUDIT_TOTAL_NEW | -0.092 | 0.094 | -0.981 | 0.328 |
| DAST_TOTAL_NEW | DAST_TOTAL_NEW | 1.034 | 0.399 | 2.594 | 0.010 |
cat("\nNote. R² =", round(model_summary$r.squared, 3),
", adjusted R² =", round(model_summary$adj.r.squared, 3), "\n")
##
## Note. R² = 0.359 , adjusted R² = 0.351
# Compare models
models_comparison <- data.frame(
Model = c("1. PIL only", "2. PIL + Substance Use"),
R2 = c(summary(model_h1)$r.squared, summary(model_h2)$r.squared),
Adj_R2 = c(summary(model_h1)$adj.r.squared, summary(model_h2)$adj.r.squared),
F = c(summary(model_h1)$fstatistic[1], summary(model_h2)$fstatistic[1])
)
kable(models_comparison,
digits = 3,
caption = "Table 2. Comparison of Regression Models") %>%
kable_styling("striped")
| Model | R2 | Adj_R2 | F |
|---|---|---|---|
|
0.342 | 0.339 | 137.107 |
|
0.359 | 0.351 | 48.842 |
If residuals are non-normally distributed: - Consider data transformations (log, square root) - Use robust regression methods - Report bootstrapped confidence intervals
If variance of residuals is unequal: - Use weighted least squares regression - Report heteroscedasticity-consistent standard errors - Consider transforming the outcome variable
✓ Regression extends correlation by allowing prediction and testing directional relationships
✓ Simple regression involves one predictor; multiple regression involves several
✓ Interpretation depends on scaling: Unstandardized coefficients (b) vs. standardized (β)
✓ R² measures model fit: The proportion of outcome variance explained
✓ Always check assumptions: Linearity, normality, homoscedasticity, independence
✓ Diagnose problems: Use plots, tests (Shapiro-Wilk, VIF), and distance measures (Cook’s D, leverage)
✓ Hierarchical regression tests whether added predictors improve model fit
✓ Effect sizes (R², semi-partial correlations) quantify the practical importance of relationships
[1] Field, A., Miles, J., & Field, Z. (2012). Discovering statistics using R. Sage Publications.
plot(model) for
quick diagnostic checksrlm() in the
MASS package for robust regressioncaret package
for evaluating model generalizationTutorial created for ANLY500: Analytics I Enter Your Name | Date: 2026-01-29