Ziyuan Huang
Last Updated: 2026-01-29
A way of predicting the value of one variable from other variables.
Field et al. (2012) defines regression as: fitting a model to data and using it to predict values of the dependent variable (DV) from one or more independent variables (IVs)
We extend beyond the data we collected to answer predictive questions
\[Y_i = b_0 + b_1X_i + \varepsilon_i\]
Key Components:
All regression coefficients (\(b\)) are sometimes called: - Gradient (slope) of the regression line - Direction/Strength of Relationship - Unstandardized coefficients (original scale)
Simple Linear Regression = SLR
Multiple Linear Regression = MLR
2 or more X variables (IVs)
MLR types include:
Is my overall model (i.e., the regression equation) useful at predicting the outcome variable?
How useful are each of the individual predictors for my model?
Our overall model uses an F-test, which tests whether the regression model significantly improves prediction compared to the null model (using just the mean).
Hypotheses for the overall test:
The F-test is always two-tailed because the test statistic is based on squared terms: \[F = \frac{MSM}{MSR}\] (systematic variance / unsystematic variance)
MSM = Mean Squares Model (improvement from regression line)
MSR = Mean Squares Residual (unexplained error)
The General Linear Model: \[\text{Outcome} = \text{(Model)} + \text{Error}\]
In regression, we predict each person’s score \(Y_i\) by:
H0 Model (No Relationship): Prediction using only the mean, resulting in a flat line (\(b_1 = 0\))
H1 Model (Relationship Exists): Prediction using the regression line, incorporating predictor slopes
Total Sum of Squares (SST): - Sum of squared differences between observed values and the mean of Y - Represents total variability in the outcome with no predictors (H0 model) - Formula: \(SST = \sum(Y_i - \bar{Y})^2\)
Residual Sum of Squares (SSR): - Sum of squared differences between observed values and predicted values from regression line - Represents variability unexplained by the model (error remaining) - Formula: \(SSR = \sum(Y_i - \hat{Y_i})^2\)
Model Sum of Squares (SSM): - Sum of squared differences between predicted values and the mean of Y - Represents variability explained by the model (improvement from adding predictors) - Formula: \(SSM = SST - SSR\) or \(SSM = \sum(\hat{Y_i} - \bar{Y})^2\)
Relationship: \(SST = SSM + SSR\) (total variation = explained + unexplained)
We test the individual predictors with a t-test:
Therefore, we might use the following hypotheses:
Or, we could use a directional test, since the test statistic t can be negative:
Unlike correlation, these statistics are often reported with t(df).
\(df = N - k - 1\)
Unstandardized Regression Coefficient (\(b\)): - Regression coefficient in the original scale of the variables - Interpretation: For every one unit increase in X, Y increases by \(b\) units - Advantage: More interpretable for your specific problem context - Disadvantage: Hard to compare across predictors with different scales - Formula: Predicted \(Y = b_0 + b_1X\)
Standardized Regression Coefficient (\(\beta\) or “beta”): - Regression coefficient after converting variables to standard deviation units - Interpretation: For every one SD increase in X, Y increases by \(\beta\) SDs - Advantage: Comparable across predictors with different scales; indicates relative importance - Disadvantage: Less interpretable for real-world problem context
When to Use Each: - Use \(b\) for answering applied questions (“How many more sales per $1000 advertising?”) - Use \(\beta\) for comparing predictor importance across variables with different scales
Mental Health and Drug Use:
## 'data.frame': 267 obs. of 4 variables:
## $ PIL_total : num 121 76 98 122 99 134 102 124 126 112 ...
## ..- attr(*, "format.spss")= chr "F8.2"
## ..- attr(*, "display_width")= int 11
## $ CESD_total : num 28 37 20 15 7 7 27 10 9 8 ...
## ..- attr(*, "format.spss")= chr "F8.2"
## ..- attr(*, "display_width")= int 12
## $ AUDIT_TOTAL_NEW: num 1 5 3 3 2 3 2 1 1 7 ...
## ..- attr(*, "format.spss")= chr "F8.2"
## ..- attr(*, "display_width")= int 17
## $ DAST_TOTAL_NEW : num 0 0 0 1 0 0 1 0 0 1 ...
## ..- attr(*, "format.spss")= chr "F8.2"
## ..- attr(*, "display_width")= int 16
## PIL_total CESD_total AUDIT_TOTAL_NEW DAST_TOTAL_NEW
## Min. : 60.0 Min. : 0.0 Min. : 0.000 Min. :0.000
## 1st Qu.:103.0 1st Qu.: 7.0 1st Qu.: 2.000 1st Qu.:0.000
## Median :111.0 Median :11.0 Median : 5.000 Median :0.000
## Mean :110.7 Mean :13.2 Mean : 6.807 Mean :0.906
## 3rd Qu.:121.0 3rd Qu.:17.0 3rd Qu.:11.000 3rd Qu.:1.000
## Max. :138.0 Max. :47.0 Max. :31.000 Max. :9.000
## NA's :1
## [1] 267
## [1] 266
In this section, we will add a few new outlier checks:
Because we are using regression as our model, we may consider using multiple checks before excluding outliers.
mahalanobis() function we have used
previously.mahal <- mahalanobis(nomiss,
colMeans(nomiss),
cov(nomiss))
cutmahal <- qchisq(1-.001, ncol(nomiss))
badmahal <- as.numeric(mahal > cutmahal) ##note the direction of the >
table(badmahal)## badmahal
## 0 1
## 261 5
lm() function with
our regression formula.Y ~ X + X: Y is approximated by X plus X.Definition - influence of that data point on the slope
Each score is the change in slope if you exclude that data point
How do we calculate how much change is bad?
k <- 3 ##number of IVs
leverage <- hatvalues(model1)
cutleverage <- (2*k+2) / nrow(nomiss)
badleverage <- as.numeric(leverage > cutleverage)
table(badleverage)## badleverage
## 0 1
## 247 19
Influence (Cook’s Distance) - a measure of how much of an effect that single case has on the whole model
Often described as leverage + discrepancy
How do we calculate how much change is bad?
cooks <- cooks.distance(model1)
cutcooks <- 4 / (nrow(nomiss) - k - 1)
badcooks <- as.numeric(cooks > cutcooks)
table(badcooks)## badcooks
## 0 1
## 251 15
## totalout
## 0 1 2 3
## 239 17 8 2
##
## Call:
## lm(formula = CESD_total ~ PIL_total + AUDIT_TOTAL_NEW + DAST_TOTAL_NEW,
## data = noout)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.904 -5.086 -1.161 3.405 29.342
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 54.19317 4.20489 12.888 <2e-16 ***
## PIL_total -0.37272 0.03629 -10.271 <2e-16 ***
## AUDIT_TOTAL_NEW -0.07774 0.09548 -0.814 0.416
## DAST_TOTAL_NEW 0.72741 0.50953 1.428 0.155
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.357 on 252 degrees of freedom
## Multiple R-squared: 0.3157, Adjusted R-squared: 0.3076
## F-statistic: 38.76 on 3 and 252 DF, p-value: < 2.2e-16
##
## Correlation of Coefficients:
## (Intercept) PIL_total AUDIT_TOTAL_NEW
## PIL_total -0.99
## AUDIT_TOTAL_NEW -0.16 0.06
## DAST_TOTAL_NEW -0.17 0.15 -0.47
If your assumptions go wrong:
##
## Call:
## lm(formula = CESD_total ~ PIL_total + AUDIT_TOTAL_NEW + DAST_TOTAL_NEW,
## data = noout)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.904 -5.086 -1.161 3.405 29.342
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 54.19317 4.20489 12.888 <2e-16 ***
## PIL_total -0.37272 0.03629 -10.271 <2e-16 ***
## AUDIT_TOTAL_NEW -0.07774 0.09548 -0.814 0.416
## DAST_TOTAL_NEW 0.72741 0.50953 1.428 0.155
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.357 on 252 degrees of freedom
## Multiple R-squared: 0.3157, Adjusted R-squared: 0.3076
## F-statistic: 38.76 on 3 and 252 DF, p-value: < 2.2e-16
## Loading required package: tinylabels
## $r2
## [1] "$R^2 = .32$, 90\\% CI $[0.23, 0.39]$, $F(3, 252) = 38.76$, $p < .001$"
##
## Call:
## lm(formula = CESD_total ~ PIL_total + AUDIT_TOTAL_NEW + DAST_TOTAL_NEW,
## data = noout)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.904 -5.086 -1.161 3.405 29.342
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 54.19317 4.20489 12.888 <2e-16 ***
## PIL_total -0.37272 0.03629 -10.271 <2e-16 ***
## AUDIT_TOTAL_NEW -0.07774 0.09548 -0.814 0.416
## DAST_TOTAL_NEW 0.72741 0.50953 1.428 0.155
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.357 on 252 degrees of freedom
## Multiple R-squared: 0.3157, Adjusted R-squared: 0.3076
## F-statistic: 38.76 on 3 and 252 DF, p-value: < 2.2e-16
## [1] "$b = -0.37$, 95\\% CI $[-0.44, -0.30]$, $t(252) = -10.27$, $p < .001$"
## [1] "$b = -0.08$, 95\\% CI $[-0.27, 0.11]$, $t(252) = -0.81$, $p = .416$"
## [1] "$b = 0.73$, 95\\% CI $[-0.28, 1.73]$, $t(252) = 1.43$, $p = .155$"
Two concerns:
QuantPsyc package for \(\beta\) values.## PIL_total AUDIT_TOTAL_NEW DAST_TOTAL_NEW
## -0.54695645 -0.04844095 0.08573100
Multiple Correlation (R): - The correlation between observed Y and predicted Y values - Ranges from 0 to 1; indicates overall model strength - In simple regression: \(R = |r_{XY}|\) (absolute correlation)
\(R^2\) (Coefficient of Determination): - Proportion of variance in Y explained by the model - Formula: \(R^2 = \frac{SSM}{SST} = 1 - \frac{SSR}{SST}\) - Interpretation: “The model explains \(R^2 \times 100\) percent of the variance in Y” - All overlap in Y, used for overall model - \(R^2 = \frac{A+B+C}{A+B+C+D}\) (explained / total)
Semipartial Correlation Squared (\(sr^2\)): - Unique contribution of this IV to \(R^2\) (variance in Y explained only by this predictor, after accounting for other predictors) - The increase in \(R^2\) when this X is added to the model - Formula: \(sr^2 = \frac{A}{A+B+C+D}\) (unique variance / total variance) - Interpretation: “Adding this predictor increases \(R^2\) by \(sr^2 \times 100\) percentage points” - Used in hierarchical regression to show incremental value of each predictor
Partial Correlation Squared (\(pr^2\)): - Proportion of remaining variance in Y (after removing other predictors’ influence) that is explained by this X - Formula: \(pr^2 = \frac{A}{A+D}\) (unique variance / variance not explained by others) - Interpretation: “Among the variance in Y not explained by other predictors, this predictor explains \(pr^2 \times 100\) percent” - Always larger than \(sr^2\) because denominator excludes shared variance: \(pr^2 > sr^2\) - Often reported alongside \(sr^2\) for complete picture of predictor importance
We would add these to our other reports:
## PIL_total CESD_total AUDIT_TOTAL_NEW DAST_TOTAL_NEW
## PIL_total 1.000000000 0.295101597 0.005899378 0.005606820
## CESD_total 0.295101597 1.000000000 0.002623799 0.008022779
## AUDIT_TOTAL_NEW 0.005899378 0.002623799 1.000000000 0.218315640
## DAST_TOTAL_NEW 0.005606820 0.008022779 0.218315640 1.000000000
Method: - Known predictors (based on past research) entered first - New predictors entered in separate steps - Tests significance of each step addition and individual predictors
Answers the Following Questions: - Is my overall model significant? (F-test for final model) - Is the addition of each step significant? (Comparison of \(R^2\) between models via \(\Delta F\)) - Are the individual predictors significant? (t-tests for each coefficient)
When to Use: - Control for known/nuisance variables first before testing new predictors - See the incremental value of adding new variables to existing model - Discuss groups of variables together as a conceptual set - Based on a priori theory (NOT exploratory/stepwise selection)
Dummy Coding (Contrast Coding): - Converts categorical variable into multiple binary indicators - Each dummy variable compares one category against a reference category - Advantage: Allows interpretation as comparisons (e.g., “Treatment vs. Control”) - Reference category gets value 0 on all dummies - Each non-reference category gets value 1 on its corresponding dummy - If k categories, create k-1 dummy variables - Interpretation of \(b\): Difference in Y between this category and reference category
Other Coding Systems: - Deviation (effect) coding, orthogonal coding, helmert coding: https://stats.idre.ucla.edu/spss/faq/coding-systems-for-categorical-variables-in-regression-analysis/
R handles dummy coding automatically: - When you use
factor() to convert a variable to a categorical type - R
creates k-1 dummy variables and uses first level as reference category -
Each dummy variable shows 1 if observation is in that category, 0
otherwise - Example coding table:
Change Reference Category in R:
Research Question: Do different depression treatments reduce depression ratings after controlling for family history?
Variables: - IVs: - Family history of depression (continuous predictor) - Treatment for depression (categorical: No Treatment, Placebo, Paxil, Effexor, Cheerup) - DV: Depression rating after treatment (continuous outcome)
## 'data.frame': 50 obs. of 3 variables:
## $ treat : num 0 0 0 0 0 0 0 0 0 0 ...
## ..- attr(*, "label")= chr "Treatment"
## ..- attr(*, "format.spss")= chr "F8.0"
## ..- attr(*, "display_width")= int 18
## ..- attr(*, "labels")= Named num [1:5] 0 1 2 3 4
## .. ..- attr(*, "names")= chr [1:5] "No Treatment" "Placebo" "Seroxat (Paxil)" "Effexor" ...
## $ familyhistory: num 6.79 6.88 19.65 10.8 32.27 ...
## ..- attr(*, "label")= chr "Family History"
## ..- attr(*, "format.spss")= chr "F8.0"
## $ after : num 16 18 13 15 18 16 18 19 9 16 ...
## ..- attr(*, "label")= chr "After Treatment"
## ..- attr(*, "format.spss")= chr "F8.0"
## $label
## [1] "Treatment"
##
## $format.spss
## [1] "F8.0"
##
## $display_width
## [1] 18
##
## $labels
## No Treatment Placebo Seroxat (Paxil) Effexor Cheerup
## 0 1 2 3 4
Data Screening: Should be done on the LAST model (skipped here for brevity)
Model 1: Base Model with Control Variable - Enter family history alone to establish baseline - Tests if family history alone predicts depression rating
- **Overall fit**: $F(1,48) = 8.50, p = .005, R^2 = .15$
- **Interpretation**: Family history accounts for 15% of variance in post-treatment depression
- **Family history predictor**: $b = 0.15, t(48) = 2.92, p = .005, pr^2 = .15$
- **Interpretation**: For each unit increase in family history, depression rating increases 0.15 units
##
## Call:
## lm(formula = after ~ familyhistory, data = hdata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.5120 -1.9028 -0.2193 2.0544 6.7958
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.00363 0.84477 13.026 <2e-16 ***
## familyhistory 0.15313 0.05254 2.915 0.0054 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.133 on 48 degrees of freedom
## Multiple R-squared: 0.1504, Adjusted R-squared: 0.1327
## F-statistic: 8.495 on 1 and 48 DF, p-value: 0.005396
Model 2: Full Model with Treatment Added - Add treatment category to Model 1 (keep family history to maintain control) - Tests if treatment incrementally predicts depression rating after controlling for family history - The overall model is significant, but focus on the change between models (not just overall significance) - Why? If Model 1 was already significant, the overall significance might just reflect Model 1’s contribution - We need \(\Delta R^2\) and \(\Delta F\) to show treatment adds predictive value beyond family history
##
## Call:
## lm(formula = after ~ familyhistory + treat, data = hdata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.7908 -1.6690 0.0508 1.6674 5.4108
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.98816 1.09637 12.759 < 2e-16 ***
## familyhistory 0.13513 0.05088 2.656 0.010973 *
## treatPlacebo -4.09905 1.21381 -3.377 0.001542 **
## treatPaxil -2.03744 1.22146 -1.668 0.102411
## treatEffexor -2.59078 1.26984 -2.040 0.047356 *
## treatCheerup -4.96339 1.22489 -4.052 0.000203 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.714 on 44 degrees of freedom
## Multiple R-squared: 0.4154, Adjusted R-squared: 0.349
## F-statistic: 6.254 on 5 and 44 DF, p-value: 0.0001832
Model Comparison via ANOVA: - Use
anova(model1, model2) to test: Does treatment add
significant predictive value? - Key statistics: - \(\Delta R^2\): Increase in R² from Model 1
to Model 2 (variance explained by treatment after family history) -
\(\Delta F\): F-test comparing models’
fit improvements - Interpretation: The addition of the
treatment set was significant: \(\Delta F(4,
44) = 4.99, p = .002, \Delta R^2 = .27\) - Treatment explains an
additional 27% of depression rating variance - This improvement is
statistically significant (p = .002)
## Analysis of Variance Table
##
## Model 1: after ~ familyhistory
## Model 2: after ~ familyhistory + treat
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 48 471.12
## 2 44 324.13 4 146.99 4.9883 0.002102 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Interpreting Dummy-Coded Coefficients: - Each dummy coefficient (\(b\)) = difference between that category and reference category - Reference category (usually first level): automatically set to 0, acts as baseline - Positive \(b\): That category has higher outcome than reference - Negative \(b\): That category has lower outcome than reference - \(b\) = difference in means, controlling for (holding constant) other predictors
Visualizing Results with emmeans: - Raw
\(b\) values can be hard to interpret -
Estimated Marginal Means (EMMs): Predicted mean Y for each group, given
other predictors’ values - Advantage: Shows group means
on original scale (not as comparisons)
##
## Call:
## lm(formula = after ~ familyhistory + treat, data = hdata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.7908 -1.6690 0.0508 1.6674 5.4108
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.98816 1.09637 12.759 < 2e-16 ***
## familyhistory 0.13513 0.05088 2.656 0.010973 *
## treatPlacebo -4.09905 1.21381 -3.377 0.001542 **
## treatPaxil -2.03744 1.22146 -1.668 0.102411
## treatEffexor -2.59078 1.26984 -2.040 0.047356 *
## treatCheerup -4.96339 1.22489 -4.052 0.000203 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.714 on 44 degrees of freedom
## Multiple R-squared: 0.4154, Adjusted R-squared: 0.349
## F-statistic: 6.254 on 5 and 44 DF, p-value: 0.0001832
## Welcome to emmeans.
## Caution: You lose important information if you filter this package's results.
## See '? untidy'
## treat emmean SE df lower.CL upper.CL
## No Treatment 15.8 0.858 44 14.11 17.6
## Placebo 11.7 0.858 44 10.01 13.5
## Paxil 13.8 0.871 44 12.04 15.6
## Effexor 13.2 0.930 44 11.37 15.1
## Cheerup 10.9 0.877 44 9.11 12.6
##
## Confidence level used: 0.95
pcor code on our categorical
variables. What can we do to calculate?model_summary <- summary(model2)
t_values <- model_summary$coefficients[ , 3]
df_t <- model_summary$df[2]
t_values^2 / (t_values^2+df_t)## (Intercept) familyhistory treatPlacebo treatPaxil treatEffexor
## 0.78721534 0.13817150 0.20583673 0.05947423 0.08642805
## treatCheerup
## 0.27175918
pwr library to calculate required
sample size for desired power and effect size## [1] 0.4154487
## [1] 0.7107138
Function Arguments: - u = degrees of
freedom for the model (numerator df, first value in F-statistic) -
v = degrees of freedom for error (denominator df); leave
blank (NULL) when solving for sample size - f2 = Cohen’s
\(f^2\) (converted effect size) -
sig.level = alpha level (typically .05) -
power = desired statistical power (typically .80)
Final Sample Size Calculation: - Output provides
v (error df) needed - Actual N = \(v + k + 1\) where k = number of
predictors
#f2 is cohen f squared
pwr.f2.test(u = model_summary$df[1],
v = NULL, f2 = f2,
sig.level = .05, power = .80)##
## Multiple regression power calculation
##
## u = 6
## v = 19.20439
## f2 = 0.7107138
## sig.level = 0.05
## power = 0.8
In this lecture, we’ve covered:
Foundations: - Regression equation and interpretation of coefficients (\(b_0\), \(b_1\), \(\varepsilon\)) - Method of least squares for finding best-fit line - Sums of squares (SST, SSR, SSM) and their meaning
Model Evaluation: - F-test for overall model significance - \(R^2\) and \(R\) as effect sizes for model fit - Comparison of H0 (mean-only) vs. H1 (regression) models
Individual Predictors: - t-tests and hypothesis testing for coefficients - Unstandardized (\(b\)) vs. standardized (\(\beta\)) coefficients - Partial and semipartial correlations for relative importance
Advanced Topics: - Regression assumptions and data screening - Outlier detection (Mahalanobis, leverage, Cook’s D) - Hierarchical regression with model comparison - Categorical predictors and dummy coding - Power analysis for sample size planning
Field et al. (2012) reference: Chapter 7, Discovering Statistics Using R