This Week 07 tutorial continues our data screening journey from Week
06, focusing on testing statistical assumptions. After
cleaning data (accuracy, missing values, outliers), we must verify that
our data meet the requirements of parametric statistical tests. This
tutorial upgrades the lecture concepts from
07_datascreen_2.rmd into hands-on, code-first
workflows:
Reference: This tutorial draws heavily from Field, Miles, and Field [1], Discovering Statistics Using R, particularly Chapters 5 and 7 on assumptions and regression diagnostics, with IEEE-style citations throughout.
Upgrade note: This hands-on tutorial extends the Week 07 lecture slides in
07_datascreen_2.rmdwith code-first workflows, beginner-friendly explanations, and theoretical grounding from Field et al. [1].
install.packages("X") once, then
library(X).By the end of this tutorial, you will be able to:
# Install if needed
# install.packages(c("tidyverse", "rio", "mice", "car", "moments", "corrplot", "knitr", "kableExtra"))
library(tidyverse)
library(ggplot2) # For data visualization
library(rio) # For data import
library(mice) # For multiple imputation
library(car) # For VIF and Durbin-Watson tests
library(moments) # For skewness and kurtosis
library(corrplot) # For correlation visualization
library(knitr)
library(kableExtra)
cat("\n=== REPRODUCIBLE SEED INFORMATION ===")##
## === REPRODUCIBLE SEED INFORMATION ===
##
## Generator Name: Week07 Data Screening Part 2 - ANLY500
##
## MD5 Hash: cccb8a5cbee9f1dd300d496467230978
##
## Available Seeds: 5912985, 166518175, 149568504, -691293678, -634303044 ...
The fundamental problem: Parametric statistical tests (t-tests, ANOVA, regression) make specific assumptions about the data. When these assumptions are violated, our results become unreliable [1].
Analogy: Think of assumptions like the foundation of a house. You can build a beautiful house on a weak foundation, but it won’t stand for long. Similarly, you can run complex analyses on data that violate assumptions, but your conclusions will be shaky [1].
How to read this diagram: 1. Start with cleaned data (after Week 06 screening) 2. Create fake regression to generate diagnostic information 3. Check each assumption systematically (1-5) 4. Decision point: Are all assumptions met? - ✓ Yes → Proceed with your planned analysis - ✗ No → Fix violations and re-check
Key reminders: - Visual inspection is as important as statistical tests - Large samples (N ≥ 30) are more forgiving of normality violations - Some violations are more serious than others (independence is critical!)
Field et al. [1] identify five critical assumptions for parametric statistics:
| Assumption | Plain_English | What_Happens_If_Violated | How_To_Check |
|---|---|---|---|
| Independence | One person’s score doesn’t influence another’s | Confidence intervals and p-values are wrong | Durbin-Watson test |
| Additivity | Predictors aren’t measuring the same thing | Can’t tell which predictor is important | Correlation matrix, VIF |
| Linearity | Relationships form straight lines, not curves | Model predictions are biased | Q-Q plots, residual plots |
| Normality | Sampling distribution is bell-shaped | p-values are unreliable (especially in small samples) | Histograms, Q-Q plots, Shapiro-Wilk |
| Homogeneity/Homoscedasticity | Variance is equal across groups/conditions | Type I error rate increases | Levene’s test, residual plots |
Quick reference guide: - Row 1 (Green): What GOOD looks like for each assumption - Row 2 (Red): What BAD looks like - violations to watch for - Row 3: Homogeneity/Homoscedasticity examples (both good and bad)
Use this as a visual checklist when examining your own data!
What is a fake regression? We create a regression model not to test a hypothesis, but as a diagnostic tool to check assumptions [1]. This approach works because:
Why “fake”? We’re not interested in the regression coefficients or p-values. We only care about the diagnostic information it provides [1].
Key insight: Residuals tell us how wrong our predictions are. If assumptions are met, residuals should be random and well-behaved [1].
Field et al. [1] describe several types of residuals. Understanding the differences helps you choose the right diagnostic:
Key differences [1]:
| Type | Formula | Interpretation | Threshold | Best_For |
|---|---|---|---|---|
| Raw Residual | e = y - ŷ | Basic error in original units | No standard threshold | Understanding raw errors |
| Standardized Residual | e / SD(e) | Error in standard deviations | |z| > 2 or 3 | Quick screening |
| Studentized Residual | e / SD(e without case i) | Error accounting for case’s leverage | |t| > 2 or 3 | Outlier detection |
| Leverage | Diagonal of hat matrix | Potential to influence predictions | > 2(k+1)/n or 3(k+1)/n | Identifying unusual X |
| Cook’s Distance | Combines residual & leverage | Actual influence on model | > 1 (concern), > 4/n (investigate) | Overall influence assessment |
In this tutorial, we use studentized residuals because they are the most appropriate for assumption checking [1].
| Component | Real_Regression | Fake_Regression |
|---|---|---|
| Y (Outcome) | Your actual outcome variable | Random chi-square variable |
| X (Predictors) | Your actual predictors | All variables in your dataset |
| Purpose | Test hypotheses about relationships | Generate diagnostics for assumption checking |
| What We Examine | Coefficients, p-values, R² | Residuals, Q-Q plots, residual plots |
We’ll use the same dataset from Week 06, picking up where we left off after removing outliers.
# Import the dataset
# Note: Ensure "data/data_screening.csv" exists in your working directory
if(file.exists("data/data_screening.csv")) {
master <- import("data/data_screening.csv")
} else {
# Simulation for tutorial purposes if file is missing
set.seed(seeds[2])
n <- 100
master <- data.frame(
Sex = sample(c(1, 2), n, replace = TRUE),
SES = sample(c(1, 2, 3), n, replace = TRUE),
Grade = sample(9:12, n, replace = TRUE),
Absences = sample(0:15, n, replace = TRUE),
RS1 = sample(1:7, n, replace = TRUE),
RS2 = sample(1:7, n, replace = TRUE),
RS3 = sample(1:7, n, replace = TRUE),
RS4 = sample(1:7, n, replace = TRUE),
RS5 = sample(1:7, n, replace = TRUE),
RS6 = sample(1:7, n, replace = TRUE),
RS7 = sample(1:7, n, replace = TRUE),
RS8 = sample(1:7, n, replace = TRUE),
RS9 = sample(1:7, n, replace = TRUE),
RS10 = sample(1:7, n, replace = TRUE)
)
# Introduce some missingness
master[sample(1:n, 5), "RS1"] <- NA
master[sample(1:n, 3), "RS5"] <- NA
}
str(master)## 'data.frame': 137 obs. of 20 variables:
## $ Sex : int 1 1 1 1 1 1 1 1 1 1 ...
## $ Age : int 17 16 13 15 16 15 12 14 13 18 ...
## $ SES : int 2 3 2 3 2 3 3 3 1 3 ...
## $ Grade : int 11 7 5 6 11 7 4 6 4 9 ...
## $ Absences: int 2 2 2 2 6 2 1 2 2 5 ...
## $ RS1 : int 6 7 5 5 7 7 1 2 6 7 ...
## $ RS2 : int 4 1 NA 5 4 7 6 3 6 6 ...
## $ RS3 : int 2 1 5 5 4 7 NA 2 7 5 ...
## $ RS4 : int 2 5 7 7 7 7 1 3 6 NA ...
## $ RS5 : int 4 7 4 5 4 7 4 2 1 6 ...
## $ RS6 : int 7 7 6 6 4 7 7 3 4 6 ...
## $ RS7 : int 7 7 7 5 7 7 7 3 6 10 ...
## $ RS8 : int 4 7 6 7 7 7 NA 3 6 7 ...
## $ RS9 : int 5 4 6 6 4 7 NA 2 2 7 ...
## $ RS10 : int 7 7 7 7 7 7 4 2 5 7 ...
## $ RS11 : int 4 1 NA 6 7 7 3 3 6 5 ...
## $ RS12 : int 7 7 5 6 4 7 7 3 6 6 ...
## $ RS13 : int 4 4 7 6 7 7 7 2 6 6 ...
## $ RS14 : int 7 7 6 6 4 7 7 3 2 6 ...
## $ Health : int 6 6 2 2 6 4 3 6 1 5 ...
Let’s quickly apply the Week 06 screening steps to prepare our data.
# Step 1: Fix categorical variables
notypos <- master
notypos$Sex <- factor(notypos$Sex,
levels = c(1,2),
labels = c("Women", "Men"))
notypos$SES <- factor(notypos$SES,
levels = c(1,2, 3),
labels = c("Low", "Medium", "High"))
# Step 2: Fix out-of-range values (if any)
notypos$Grade[ notypos$Grade > 15 ] <- NA
rs_cols <- grep("^RS", names(notypos))
notypos[ , rs_cols][ notypos[ , rs_cols] > 7 ] <- NA
# Step 3: Handle missing data
percentmiss <- function(x){ sum(is.na(x))/length(x) * 100 }
missing <- apply(notypos, 1, percentmiss)
replace_rows <- subset(notypos, missing <= 5)
# Step 4: Multiple imputation
temp_no_miss <- mice(replace_rows, printFlag = FALSE, seed = seeds[3])
nomiss <- complete(temp_no_miss, 1)
all_columns <- nomiss
# Step 5: Remove multivariate outliers
cont_vars_cols <- c("Grade", "Absences", paste0("RS", 1:10))
cont_vars <- all_columns[, cont_vars_cols]
mahal <- mahalanobis(cont_vars,
colMeans(cont_vars, na.rm=TRUE),
cov(cont_vars, use ="pairwise.complete.obs"))
cutoff <- qchisq(1-.001, ncol(cont_vars))
noout <- subset(all_columns, mahal < cutoff)
cat("Final sample size after screening:", nrow(noout), "\n")## Final sample size after screening: 123
Interpretation: We now have a clean dataset with no typos, no missing values, and no extreme outliers. We’re ready to check assumptions!
Before we check assumptions, let’s understand what makes a case “influential” [1]:
Key concepts from Field et al. [1]:
Key takeaways [1]: - Outliers have large residuals but may not influence the model much - High leverage points have unusual X values and potential to influence - Influential cases combine both: unusual X + large residual = changes the model substantially - Cook’s Distance > 1 indicates a case that substantially influences the regression [1]
# Create a comprehensive data overview
par(mfrow = c(2, 3))
# 1. Sample size by groups
sex_counts <- table(noout$Sex)
barplot(sex_counts, col = c("lightblue", "lightpink"),
main = "Sample Size by Sex",
ylab = "Count", ylim = c(0, max(sex_counts) * 1.2))
text(x = 1:length(sex_counts), y = sex_counts + max(sex_counts) * 0.05,
labels = sex_counts, font = 2)
ses_counts <- table(noout$SES)
barplot(ses_counts, col = c("lightcoral", "lightyellow", "lightgreen"),
main = "Sample Size by SES",
ylab = "Count", ylim = c(0, max(ses_counts) * 1.2))
text(x = 1:length(ses_counts), y = ses_counts + max(ses_counts) * 0.05,
labels = ses_counts, font = 2)
# 2. Distribution of continuous variables
hist(noout$Grade, breaks = 10, col = "steelblue", border = "white",
main = "Distribution: Grade",
xlab = "Grade Level", ylab = "Frequency")
hist(noout$Absences, breaks = 10, col = "orange", border = "white",
main = "Distribution: Absences",
xlab = "Number of Absences", ylab = "Frequency")
# 3. Example resilience scores
hist(noout$RS1, breaks = 7, col = "purple", border = "white",
main = "Distribution: Resilience Score 1",
xlab = "RS1 Score (1-7)", ylab = "Frequency")
# 4. Summary statistics text
plot.new()
text(0.5, 0.9, "Dataset Summary", font = 2, cex = 1.5)
text(0.1, 0.75, sprintf("Total N: %d", nrow(noout)), adj = 0, cex = 1.2)
text(0.1, 0.65, sprintf("Variables: %d", ncol(noout)), adj = 0, cex = 1.2)
text(0.1, 0.55, sprintf("Continuous: %d", length(grep("^RS|Grade|Absences", names(noout)))), adj = 0, cex = 1.2)
text(0.1, 0.45, sprintf("Categorical: 2 (Sex, SES)"), adj = 0, cex = 1.2)
text(0.1, 0.30, "Ready for assumption testing!", adj = 0, cex = 1.3, col = "darkgreen", font = 2)Definition [1]: The independence assumption states that the errors (residuals) in your model should not be related to each other.
Plain English Example: Imagine you’re measuring test scores from different students. Independence means that one student’s score shouldn’t influence another student’s score. If students copied from each other, their scores would be related (not independent), which would violate this assumption [1].
When is independence violated?
The Durbin-Watson test checks whether adjacent residuals (errors) are correlated [1].
Test statistic interpretation: - DW ≈ 2: Residuals are uncorrelated (GOOD!) - DW < 1 or DW > 3: Serious concern about independence [1] - DW < 2: Positive autocorrelation (adjacent errors are similar) - DW > 2: Negative autocorrelation (adjacent errors alternate)
Interpretation: - Top row: Time series plots showing how residuals change over time - Bottom row: ACF (Autocorrelation Function) plots - bars outside blue lines indicate significant correlation - Independent residuals: Random pattern, no correlation between adjacent points - Positive autocorrelation: Smooth waves - if one error is positive, the next tends to be positive too - Negative autocorrelation: Zigzag pattern - errors alternate between positive and negative
# Create a random outcome variable (chi-square distributed)
# Why chi-square? Because errors should have lots of small values, few large ones
random <- rchisq(nrow(noout), 7) # df=7 works well for most cases
# Create the fake regression
# Y ~ . means "predict random from ALL other variables"
fake <- lm(random ~ ., data = noout)
# Extract standardized residuals
standardized <- rstudent(fake)
# Extract standardized fitted values
fitvalues <- scale(fake$fitted.values)
cat("Fake regression created successfully!\n")## Fake regression created successfully!
## Number of observations: 123
## Number of predictors: 21
Code explanation: - rchisq(n, df):
Generates random numbers from a chi-square distribution -
lm(random ~ .): Creates a linear model predicting
random from all variables - rstudent():
Calculates studentized residuals (standardized, accounting for leverage)
- scale(): Standardizes fitted values to have mean=0,
sd=1
## lag Autocorrelation D-W Statistic p-value
## 1 -0.1031872 2.187277 0.48
## Alternative hypothesis: rho != 0
# Interpretation helper
dw_stat <- dw_result$dw
if(dw_stat >= 1 && dw_stat <= 3) {
cat("\n✓ GOOD: Durbin-Watson statistic is", round(dw_stat, 3),
"\n This is within acceptable range (1-3).")
cat("\n Independence assumption is likely met.\n")
} else {
cat("\n✗ WARNING: Durbin-Watson statistic is", round(dw_stat, 3),
"\n This is outside acceptable range (1-3).")
cat("\n Independence assumption may be violated.\n")
}##
## ✓ GOOD: Durbin-Watson statistic is 2.187
## This is within acceptable range (1-3).
## Independence assumption is likely met.
Interpretation guide:
| DW_Value | Interpretation | Action |
|---|---|---|
| < 1 | Strong positive autocorrelation - CONCERN | Consider time-series models or mixed models |
| 1 to 1.5 | Mild positive autocorrelation - Monitor | Check data collection order; may be acceptable |
| 1.5 to 2.5 | No autocorrelation - GOOD | Proceed with standard analyses |
| 2.5 to 3 | Mild negative autocorrelation - Monitor | Check data collection order; may be acceptable |
| > 3 | Strong negative autocorrelation - CONCERN | Consider time-series models or mixed models |
Definition [1]: If you have several predictors, their combined effect should be best described by adding their effects together.
Plain English Example: Think of additivity like ingredients in a recipe. If you’re predicting cake quality from flour and sugar, each ingredient should contribute its own unique effect. If flour and sugar were essentially measuring the same thing (like two different brands of the same ingredient), you’d be counting the same effect twice [2].
The problem: Multicollinearity
When predictors are too highly correlated (r > .80 or .90), we have multicollinearity [2]. This causes:
Key insight: When predictors are highly correlated (right plot), they’re essentially measuring the same thing. Including both doesn’t add much new information and makes it hard to determine which one is actually important [2].
# Select only continuous variables for correlation check
cont_vars_for_cor <- noout %>%
select(Grade, Absences, starts_with("RS"))
# Calculate correlation matrix
cor_matrix <- cor(cont_vars_for_cor)
# Display correlation matrix
cor_matrix %>%
kable(caption = "Correlation Matrix of Continuous Variables", digits = 3) %>%
kable_styling(bootstrap_options = c("striped", "hover"),
full_width = FALSE, font_size = 11) %>%
scroll_box(width = "100%", height = "400px")| Grade | Absences | RS1 | RS2 | RS3 | RS4 | RS5 | RS6 | RS7 | RS8 | RS9 | RS10 | RS11 | RS12 | RS13 | RS14 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Grade | 1.000 | -0.087 | 0.116 | 0.047 | 0.118 | -0.007 | 0.114 | 0.185 | 0.228 | 0.210 | 0.022 | 0.184 | 0.082 | 0.074 | 0.106 | 0.178 |
| Absences | -0.087 | 1.000 | 0.070 | 0.152 | 0.234 | 0.115 | 0.102 | 0.020 | 0.181 | 0.074 | -0.071 | 0.146 | 0.214 | 0.028 | 0.161 | -0.036 |
| RS1 | 0.116 | 0.070 | 1.000 | 0.289 | 0.386 | 0.309 | 0.287 | 0.276 | 0.496 | 0.290 | 0.324 | 0.410 | 0.310 | 0.317 | 0.296 | 0.211 |
| RS2 | 0.047 | 0.152 | 0.289 | 1.000 | 0.586 | 0.405 | 0.270 | 0.538 | 0.587 | 0.444 | 0.523 | 0.652 | 0.633 | 0.322 | 0.637 | 0.337 |
| RS3 | 0.118 | 0.234 | 0.386 | 0.586 | 1.000 | 0.533 | 0.430 | 0.419 | 0.465 | 0.439 | 0.330 | 0.429 | 0.536 | 0.329 | 0.612 | 0.392 |
| RS4 | -0.007 | 0.115 | 0.309 | 0.405 | 0.533 | 1.000 | 0.329 | 0.244 | 0.279 | 0.191 | 0.332 | 0.443 | 0.390 | 0.109 | 0.393 | 0.213 |
| RS5 | 0.114 | 0.102 | 0.287 | 0.270 | 0.430 | 0.329 | 1.000 | 0.372 | 0.335 | 0.361 | 0.395 | 0.391 | 0.452 | 0.478 | 0.401 | 0.543 |
| RS6 | 0.185 | 0.020 | 0.276 | 0.538 | 0.419 | 0.244 | 0.372 | 1.000 | 0.603 | 0.460 | 0.581 | 0.499 | 0.584 | 0.458 | 0.613 | 0.532 |
| RS7 | 0.228 | 0.181 | 0.496 | 0.587 | 0.465 | 0.279 | 0.335 | 0.603 | 1.000 | 0.512 | 0.482 | 0.613 | 0.541 | 0.436 | 0.545 | 0.438 |
| RS8 | 0.210 | 0.074 | 0.290 | 0.444 | 0.439 | 0.191 | 0.361 | 0.460 | 0.512 | 1.000 | 0.479 | 0.492 | 0.472 | 0.512 | 0.571 | 0.412 |
| RS9 | 0.022 | -0.071 | 0.324 | 0.523 | 0.330 | 0.332 | 0.395 | 0.581 | 0.482 | 0.479 | 1.000 | 0.647 | 0.626 | 0.513 | 0.556 | 0.549 |
| RS10 | 0.184 | 0.146 | 0.410 | 0.652 | 0.429 | 0.443 | 0.391 | 0.499 | 0.613 | 0.492 | 0.647 | 1.000 | 0.636 | 0.452 | 0.576 | 0.492 |
| RS11 | 0.082 | 0.214 | 0.310 | 0.633 | 0.536 | 0.390 | 0.452 | 0.584 | 0.541 | 0.472 | 0.626 | 0.636 | 1.000 | 0.497 | 0.740 | 0.380 |
| RS12 | 0.074 | 0.028 | 0.317 | 0.322 | 0.329 | 0.109 | 0.478 | 0.458 | 0.436 | 0.512 | 0.513 | 0.452 | 0.497 | 1.000 | 0.567 | 0.529 |
| RS13 | 0.106 | 0.161 | 0.296 | 0.637 | 0.612 | 0.393 | 0.401 | 0.613 | 0.545 | 0.571 | 0.556 | 0.576 | 0.740 | 0.567 | 1.000 | 0.539 |
| RS14 | 0.178 | -0.036 | 0.211 | 0.337 | 0.392 | 0.213 | 0.543 | 0.532 | 0.438 | 0.412 | 0.549 | 0.492 | 0.380 | 0.529 | 0.539 | 1.000 |
# Create a correlation plot
corrplot(cor_matrix,
method = "color",
type = "upper",
order = "hclust",
addCoef.col = "black",
number.cex = 0.7,
tl.col = "black",
tl.srt = 45,
title = "Correlation Matrix: Checking for Multicollinearity",
mar = c(0,0,2,0))
# Add interpretation guide
legend("bottomright",
legend = c("r > 0.90: CONCERN", "r > 0.80: Monitor", "r < 0.80: OK"),
fill = c("darkred", "orange", "lightblue"),
cex = 0.8)Interpretation: - Dark blue/red: Strong correlations - Light colors: Weak correlations - Look for: Any correlations > 0.80 or 0.90 [2]
# Find high correlations (excluding diagonal)
high_cor <- which(abs(cor_matrix) > 0.80 & abs(cor_matrix) < 1, arr.ind = TRUE)
if(nrow(high_cor) > 0) {
cat("⚠ WARNING: Found", nrow(high_cor)/2, "pairs with correlations > 0.80:\n\n")
for(i in 1:nrow(high_cor)) {
if(high_cor[i,1] < high_cor[i,2]) { # Avoid duplicates
var1 <- rownames(cor_matrix)[high_cor[i,1]]
var2 <- colnames(cor_matrix)[high_cor[i,2]]
cor_val <- cor_matrix[high_cor[i,1], high_cor[i,2]]
cat(sprintf(" %s <-> %s: r = %.3f\n", var1, var2, cor_val))
}
}
cat("\nConsider combining these variables or selecting one from each pair.\n")
} else {
cat("✓ GOOD: No correlations exceed 0.80.\n")
cat(" Multicollinearity is not a concern based on correlations.\n")
}## ✓ GOOD: No correlations exceed 0.80.
## Multicollinearity is not a concern based on correlations.
What is VIF? The Variance Inflation Factor measures how much the variance of a regression coefficient is inflated due to multicollinearity [2].
Interpretation: - VIF < 10: Generally acceptable [2] - VIF > 10: Serious multicollinearity problem [2] - Average VIF >> 1: Possible bias in the model [2]
# Calculate VIF for each predictor
# Note: VIF requires a real regression, so we use our fake regression
# VIF can only be calculated for numeric predictors, so we need to handle factors
vif_values <- vif(fake)
# Handle the case where VIF returns a matrix (for factors with multiple levels)
if(is.matrix(vif_values)) {
# Extract GVIF^(1/(2*Df)) for factors, regular VIF for numeric
vif_clean <- vif_values[, "GVIF^(1/(2*Df))"]
} else {
vif_clean <- vif_values
}
# Create a nice table
vif_df <- data.frame(
Variable = names(vif_clean),
VIF = as.numeric(vif_clean),
Status = ifelse(as.numeric(vif_clean) > 10, "⚠ CONCERN",
ifelse(as.numeric(vif_clean) > 5, "Monitor", "✓ OK"))
)
vif_df %>%
arrange(desc(VIF)) %>%
kable(caption = "Variance Inflation Factors (VIF)", digits = 3) %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
row_spec(which(vif_df$VIF > 10), bold = TRUE, color = "white", background = "darkred") %>%
row_spec(which(vif_df$VIF > 5 & vif_df$VIF <= 10), bold = TRUE, background = "orange")| Variable | VIF | Status |
|---|---|---|
| RS13 | 2.052 | ✓ OK |
| RS11 | 1.934 | ✓ OK |
| RS10 | 1.822 | ✓ OK |
| RS2 | 1.769 | ✓ OK |
| RS9 | 1.733 | ✓ OK |
| RS3 | 1.732 | ✓ OK |
| RS7 | 1.729 | ✓ OK |
| RS6 | 1.604 | ✓ OK |
| RS14 | 1.571 | ✓ OK |
| RS12 | 1.466 | ✓ OK |
| RS4 | 1.446 | ✓ OK |
| RS5 | 1.422 | ✓ OK |
| Age | 1.396 | ✓ OK |
| RS8 | 1.393 | ✓ OK |
| Grade | 1.306 | ✓ OK |
| RS1 | 1.263 | ✓ OK |
| Absences | 1.234 | ✓ OK |
| SES | 1.217 | ✓ OK |
| Sex | 1.131 | ✓ OK |
| Health | 1.083 | ✓ OK |
# Overall assessment
avg_vif <- mean(as.numeric(vif_clean), na.rm = TRUE)
cat("\nAverage VIF:", round(avg_vif, 3), "\n")##
## Average VIF: 1.515
if(max(as.numeric(vif_clean), na.rm = TRUE) < 10 && avg_vif < 2) {
cat("✓ GOOD: All VIF values < 10 and average VIF < 2.\n")
cat(" Multicollinearity is not a concern.\n")
} else if(max(as.numeric(vif_clean), na.rm = TRUE) >= 10) {
cat("⚠ WARNING: Some VIF values ≥ 10.\n")
cat(" Consider removing or combining highly correlated predictors.\n")
} else {
cat("⚠ CAUTION: Average VIF > 2.\n")
cat(" Multicollinearity may be biasing the model.\n")
}## ✓ GOOD: All VIF values < 10 and average VIF < 2.
## Multicollinearity is not a concern.
# Create a bar plot of VIF values
ggplot(vif_df, aes(x = reorder(Variable, VIF), y = VIF, fill = Status)) +
geom_bar(stat = "identity", color = "black") +
geom_hline(yintercept = 10, linetype = "dashed", color = "red", linewidth = 1) +
geom_hline(yintercept = 5, linetype = "dashed", color = "orange", linewidth = 1) +
annotate("text", x = 1, y = 10.5, label = "VIF = 10 (Concern)", hjust = 0, color = "red") +
annotate("text", x = 1, y = 5.5, label = "VIF = 5 (Monitor)", hjust = 0, color = "orange") +
coord_flip() +
labs(title = "Variance Inflation Factors (VIF)",
subtitle = "Lower is better; VIF > 10 indicates multicollinearity",
x = "Variable",
y = "VIF Value") +
scale_fill_manual(values = c("✓ OK" = "lightgreen",
"Monitor" = "orange",
"⚠ CONCERN" = "darkred")) +
theme_minimal() +
theme(legend.position = "bottom")Definition [3]: The linearity assumption states that the relationship between variables is linear (straight line) and not curved [3].
Plain English Example: Imagine plotting height vs. weight. Linearity means the relationship forms a straight line (or close to it), not a curve. If taller people get heavier at a steady rate, that’s linear. If the relationship curves (like exponential growth), that violates linearity [3].
Why it matters: Most parametric statistics (ANOVA, regression, t-tests) assume linear relationships. If relationships are curved, our models will make biased predictions [3].
Key insight: - Top row: Original data with fitted lines. Linear model (red line) fits well only for linear data. - Bottom row: Residual plots reveal problems. Random scatter = good; patterns = bad [3].
What is a Q-Q plot? QQ stands for “quantile-quantile.” This plot compares your data to what we’d expect from a perfect normal distribution [5]. If the dots follow the diagonal line closely, your data are normally distributed [5].
# Create Q-Q plot
qqnorm(standardized, main = "Q-Q Plot: Assessing Normality of Residuals")
qqline(standardized, col = "red", lwd = 2)
# Add interpretation zones
abline(h = c(-2, 2), lty = 2, col = "blue")
text(-2, 2.5, "Most data should\nfall between -2 and +2", col = "blue", cex = 0.9)How to interpret:
# Calculate how many points fall outside ±2
outside_range <- sum(abs(standardized) > 2, na.rm = TRUE)
percent_outside <- (outside_range / length(standardized)) * 100
cat("Standardized residuals outside ±2:", outside_range,
sprintf("(%.1f%%)\n", percent_outside))## Standardized residuals outside ±2: 4 (3.3%)
if(percent_outside <= 5) {
cat("✓ GOOD: ≤5% of residuals fall outside ±2.\n")
cat(" This is expected for normally distributed data.\n")
} else {
cat("⚠ WARNING: >5% of residuals fall outside ±2.\n")
cat(" Normality assumption may be violated.\n")
}## ✓ GOOD: ≤5% of residuals fall outside ±2.
## This is expected for normally distributed data.
# Create a data frame for plotting
qq_data <- data.frame(
Theoretical = qqnorm(standardized, plot.it = FALSE)$x,
Sample = qqnorm(standardized, plot.it = FALSE)$y,
Outlier = abs(standardized) > 2
)
# Create enhanced Q-Q plot
ggplot(qq_data, aes(x = Theoretical, y = Sample)) +
geom_point(aes(color = Outlier), size = 2, alpha = 0.6) +
geom_abline(intercept = 0, slope = 1, color = "red", linewidth = 1) +
geom_hline(yintercept = c(-2, 2), linetype = "dashed", color = "blue") +
geom_vline(xintercept = c(-2, 2), linetype = "dashed", color = "blue") +
scale_color_manual(values = c("FALSE" = "black", "TRUE" = "red"),
labels = c("Within ±2", "Outside ±2")) +
labs(title = "Q-Q Plot: Normality of Standardized Residuals",
subtitle = "Points should fall on the red diagonal line",
x = "Theoretical Quantiles",
y = "Sample Quantiles (Standardized Residuals)",
color = "Status") +
theme_minimal() +
theme(legend.position = "bottom")What are residual plots? They plot the predicted values (X-axis) against the residuals/errors (Y-axis). If the model is good, residuals should be randomly scattered around zero [3].
# Create residual plot
plot(fitvalues, standardized,
main = "Residual Plot: Checking Linearity",
xlab = "Standardized Fitted Values",
ylab = "Standardized Residuals",
pch = 19, col = rgb(0, 0, 0, 0.5))
abline(h = 0, col = "red", lwd = 2)
abline(v = 0, col = "red", lwd = 2)
abline(h = c(-2, 2), lty = 2, col = "blue")
# Add interpretation text
text(min(fitvalues, na.rm=TRUE), 3,
"Look for: Random scatter\nNo curves or patterns",
adj = 0, col = "darkgreen", cex = 0.9)What to look for:
Common misconception: “My data need to be normally distributed.”
The truth [6]: The assumption is that the sampling distribution is normally distributed, not necessarily your raw data [6].
Plain English: We don’t need every individual score to be perfectly normal. What matters is that if we took many samples and calculated their means, those means would form a normal distribution [6].
The Central Limit Theorem [6]: In large samples (N > 30), the sampling distribution tends to be normal anyway, regardless of how the original data look [6].
Practical implication: With large samples, normality is less of a concern. Focus on outliers instead [6].
The magic of the Central Limit Theorem: - Left: Original population is highly skewed (NOT normal) - Middle: With small samples (n=5), means are still somewhat skewed - Right: With n≥30, sample means become normally distributed!
Conclusion: Even if your raw data are skewed, your statistical tests can still be valid if your sample size is large enough [6].
## Sample size: 123
if(nrow(noout) >= 30) {
cat("✓ GOOD: Sample size ≥ 30.\n")
cat(" Central Limit Theorem applies - sampling distribution should be normal.\n")
cat(" Normality of raw data is less critical.\n")
} else {
cat("⚠ CAUTION: Sample size < 30.\n")
cat(" Normality of raw data is more important in small samples.\n")
}## ✓ GOOD: Sample size ≥ 30.
## Central Limit Theorem applies - sampling distribution should be normal.
## Normality of raw data is less critical.
# Create histogram of standardized residuals
hist(standardized,
breaks = 15,
main = "Histogram of Standardized Residuals",
xlab = "Standardized Residuals",
ylab = "Frequency",
col = "lightblue",
border = "black")
# Overlay normal curve
x_seq <- seq(min(standardized, na.rm=TRUE),
max(standardized, na.rm=TRUE),
length.out = 100)
y_norm <- dnorm(x_seq, mean = mean(standardized, na.rm=TRUE),
sd = sd(standardized, na.rm=TRUE))
y_norm <- y_norm * length(standardized) * diff(range(standardized, na.rm=TRUE)) / 15
lines(x_seq, y_norm, col = "red", lwd = 2)
legend("topright",
legend = c("Observed", "Normal Distribution"),
fill = c("lightblue", NA),
border = c("black", NA),
lty = c(NA, 1),
col = c(NA, "red"),
lwd = c(NA, 2))Interpretation: - Bell-shaped: GOOD! Residuals are normally distributed [5] - Skewed (lopsided): Normality may be violated - Flat or too pointy: Kurtosis issues
Skewness: Measures asymmetry of the distribution [5] - Skewness ≈ 0: Symmetric (normal) - Skewness > 0: Positive skew (long right tail) - Skewness < 0: Negative skew (long left tail) - Rule of thumb: |Skewness| < 2 is acceptable [5]
Kurtosis: Measures “peakedness” of the distribution [5] - Excess Kurtosis ≈ 0: Normal peakedness - Excess Kurtosis > 0: Heavy tails (more outliers than normal) - Excess Kurtosis < 0: Light tails (fewer outliers than normal) - Rule of thumb: |Excess Kurtosis| < 7 is acceptable [5]
# Calculate skewness and kurtosis for each continuous variable
skew_kurt_df <- data.frame(
Variable = names(cont_vars_for_cor),
Skewness = sapply(cont_vars_for_cor, skewness, na.rm = TRUE),
Excess_Kurtosis = sapply(cont_vars_for_cor, function(x) kurtosis(x, na.rm = TRUE) - 3),
Skew_Status = ifelse(abs(sapply(cont_vars_for_cor, skewness, na.rm = TRUE)) < 2,
"✓ OK", "⚠ CONCERN"),
Kurt_Status = ifelse(abs(sapply(cont_vars_for_cor, function(x) kurtosis(x, na.rm = TRUE) - 3)) < 7,
"✓ OK", "⚠ CONCERN")
)
skew_kurt_df %>%
kable(caption = "Skewness and Kurtosis for Continuous Variables", digits = 3) %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
row_spec(which(skew_kurt_df$Skew_Status == "⚠ CONCERN" |
skew_kurt_df$Kurt_Status == "⚠ CONCERN"),
bold = TRUE, background = "lightyellow")| Variable | Skewness | Excess_Kurtosis | Skew_Status | Kurt_Status | |
|---|---|---|---|---|---|
| Grade | Grade | 0.122 | -0.726 | ✓ OK | ✓ OK |
| Absences | Absences | 0.516 | -1.032 | ✓ OK | ✓ OK |
| RS1 | RS1 | -0.605 | -0.507 | ✓ OK | ✓ OK |
| RS2 | RS2 | -0.465 | -0.695 | ✓ OK | ✓ OK |
| RS3 | RS3 | 0.075 | -0.962 | ✓ OK | ✓ OK |
| RS4 | RS4 | -0.392 | -0.996 | ✓ OK | ✓ OK |
| RS5 | RS5 | -0.123 | -1.130 | ✓ OK | ✓ OK |
| RS6 | RS6 | -0.373 | -0.850 | ✓ OK | ✓ OK |
| RS7 | RS7 | -0.619 | -0.535 | ✓ OK | ✓ OK |
| RS8 | RS8 | -0.186 | -1.231 | ✓ OK | ✓ OK |
| RS9 | RS9 | -0.302 | -0.767 | ✓ OK | ✓ OK |
| RS10 | RS10 | -0.638 | -0.658 | ✓ OK | ✓ OK |
| RS11 | RS11 | -0.298 | -1.044 | ✓ OK | ✓ OK |
| RS12 | RS12 | -1.057 | 0.269 | ✓ OK | ✓ OK |
| RS13 | RS13 | -0.542 | -0.522 | ✓ OK | ✓ OK |
| RS14 | RS14 | -0.455 | -0.761 | ✓ OK | ✓ OK |
# Create a visual comparison
par(mfrow = c(2, 2))
# Normal distribution
set.seed(seeds[5])
normal_data <- rnorm(1000, 0, 1)
hist(normal_data, main = "Normal (Skew=0, Kurt=0)",
xlab = "Value", col = "lightgreen", border = "black")
# Positive skew
pos_skew <- rexp(1000, 1)
hist(pos_skew, main = "Positive Skew (Skew>0)",
xlab = "Value", col = "lightcoral", border = "black")
# Negative skew
neg_skew <- -rexp(1000, 1)
hist(neg_skew, main = "Negative Skew (Skew<0)",
xlab = "Value", col = "lightcoral", border = "black")
# Heavy tails (high kurtosis)
heavy_tails <- rt(1000, df = 3)
hist(heavy_tails, main = "Heavy Tails (Kurt>0)",
xlab = "Value", col = "lightyellow", border = "black")What is it? A formal statistical test for normality [5].
Interpretation: - p > .05: Data are consistent with a normal distribution (GOOD) - p < .05: Data significantly differ from normal (CONCERN)
Warning [5]: In large samples, the Shapiro-Wilk test can be significant even for minor deviations from normality. Always interpret alongside visual checks (histograms, Q-Q plots).
# Shapiro-Wilk test on standardized residuals
shapiro_result <- shapiro.test(standardized)
print(shapiro_result)##
## Shapiro-Wilk normality test
##
## data: standardized
## W = 0.95979, p-value = 0.001027
# Interpretation
if(shapiro_result$p.value > 0.05) {
cat("\n✓ GOOD: Shapiro-Wilk p-value =", round(shapiro_result$p.value, 4), "\n")
cat(" Residuals are consistent with a normal distribution.\n")
} else {
cat("\n⚠ WARNING: Shapiro-Wilk p-value =", round(shapiro_result$p.value, 4), "\n")
cat(" Residuals significantly differ from normal distribution.\n")
if(nrow(noout) >= 30) {
cat(" However, with N ≥ 30, Central Limit Theorem applies.\n")
cat(" Check visual plots (Q-Q, histogram) for practical significance.\n")
}
}##
## ⚠ WARNING: Shapiro-Wilk p-value = 0.001
## Residuals significantly differ from normal distribution.
## However, with N ≥ 30, Central Limit Theorem applies.
## Check visual plots (Q-Q, histogram) for practical significance.
Definition [7]: The assumption that variances are roughly equal across groups [7].
Plain English Example: Imagine comparing test scores from three classes. Homogeneity of variance means that the spread of scores (how much they vary) should be similar in all three classes. If one class has scores all between 80-90 (small variance) and another has scores between 40-100 (large variance), this assumption is violated [7].
Definition [8]: The spread of the variance of a variable is the same across all values of another variable [8].
Plain English Example: Imagine predicting income from years of education. Homoscedasticity means the variability in income should be roughly the same whether someone has 12 years or 20 years of education. If income becomes much more variable at higher education levels (some people make a lot, some don’t), that’s heteroscedasticity (violation of this assumption) [8].
Key insight: - Top row: Original data with green/red boxes showing variance at different X values - Bottom row: Residual plots make patterns easier to see - Homoscedastic: Variance boxes are similar height; residuals evenly scattered - Heteroscedastic: Variance boxes differ in height; residuals show funnel or bow tie patterns [8]
| Aspect | Homogeneity | Homoscedasticity |
|---|---|---|
| Context | Categorical groups (ANOVA, t-test) | Continuous predictor (Regression) |
| What it checks | Equal variance across groups | Equal variance across predictor range |
| Visualization | Boxplots by group | Residual plot (fitted vs. residuals) |
| Test | Levene’s test | Visual inspection of residual plot |
# Create enhanced residual plot
plot_data <- data.frame(
Fitted = as.numeric(fitvalues),
Residuals = standardized,
Outlier = abs(standardized) > 2
)
ggplot(plot_data, aes(x = Fitted, y = Residuals)) +
geom_point(aes(color = Outlier), size = 2, alpha = 0.6) +
geom_hline(yintercept = 0, color = "red", linewidth = 1) +
geom_vline(xintercept = 0, color = "red", linewidth = 1) +
geom_hline(yintercept = c(-2, 2), linetype = "dashed", color = "blue") +
geom_smooth(method = "loess", se = TRUE, color = "darkgreen", linewidth = 1) +
scale_color_manual(values = c("FALSE" = "black", "TRUE" = "red"),
labels = c("Within ±2", "Outside ±2")) +
labs(title = "Residual Plot: Checking Homoscedasticity",
subtitle = "Points should be randomly scattered with equal spread across X-axis",
x = "Standardized Fitted Values",
y = "Standardized Residuals",
color = "Status") +
annotate("text", x = min(plot_data$Fitted, na.rm=TRUE), y = 3,
label = "✓ GOOD: Random scatter\n✗ BAD: Funnel, curve, or clusters",
hjust = 0, vjust = 1, color = "darkgreen", size = 3.5) +
theme_minimal() +
theme(legend.position = "bottom")What to look for:
What is Levene’s test? A formal test for homogeneity of variance across groups [7].
Interpretation: - p > .05: Variances are not significantly different (GOOD) - p < .05: Variances are significantly different (CONCERN)
Note: Levene’s test requires categorical grouping variables. We’ll demonstrate with Sex and SES.
# Levene's test for Sex groups
# We'll test if RS1 variance differs by Sex
levene_sex <- leveneTest(noout$RS1, noout$Sex, center = median)
print(levene_sex)## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 1 0.0257 0.8729
## 121
##
## Interpretation for Sex:
if(levene_sex$`Pr(>F)`[1] > 0.05) {
cat("✓ GOOD: Levene's test p-value =", round(levene_sex$`Pr(>F)`[1], 4), "\n")
cat(" Variances are not significantly different across Sex groups.\n")
} else {
cat("⚠ WARNING: Levene's test p-value =", round(levene_sex$`Pr(>F)`[1], 4), "\n")
cat(" Variances are significantly different across Sex groups.\n")
cat(" Consider using Welch's correction or non-parametric tests.\n")
}## ✓ GOOD: Levene's test p-value = 0.8729
## Variances are not significantly different across Sex groups.
# Levene's test for SES groups
levene_ses <- leveneTest(noout$RS1, noout$SES, center = median)
print(levene_ses)## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 2 1.6957 0.1878
## 120
##
## Interpretation for SES:
if(levene_ses$`Pr(>F)`[1] > 0.05) {
cat("✓ GOOD: Levene's test p-value =", round(levene_ses$`Pr(>F)`[1], 4), "\n")
cat(" Variances are not significantly different across SES groups.\n")
} else {
cat("⚠ WARNING: Levene's test p-value =", round(levene_ses$`Pr(>F)`[1], 4), "\n")
cat(" Variances are significantly different across SES groups.\n")
cat(" Consider using Welch's correction or non-parametric tests.\n")
}## ✓ GOOD: Levene's test p-value = 0.1878
## Variances are not significantly different across SES groups.
# Create boxplots to visualize variance differences
par(mfrow = c(1, 2))
# RS1 by Sex
boxplot(RS1 ~ Sex, data = noout,
main = "RS1 by Sex",
xlab = "Sex",
ylab = "RS1 Score",
col = c("lightblue", "lightpink"))
# RS1 by SES
boxplot(RS1 ~ SES, data = noout,
main = "RS1 by SES",
xlab = "Socioeconomic Status",
ylab = "RS1 Score",
col = c("lightcoral", "lightyellow", "lightgreen"))Interpretation: If the boxes have similar heights (interquartile ranges), variances are roughly equal. If one box is much taller/shorter than others, homogeneity may be violated [7].
| Assumption | Test_Method | Good_Result | If_Violated |
|---|---|---|---|
| Independence | Durbin-Watson test | DW: 1.5-2.5, p > .05 | Use mixed models or time-series methods |
| Additivity (Multicollinearity) | Correlation matrix, VIF | r < .80, VIF < 10 | Remove/combine correlated predictors |
| Linearity | Q-Q plot, Residual plot | Points on Q-Q line, Random scatter | Transform variables or use non-linear models |
| Normality | Histogram, Q-Q plot, Shapiro-Wilk, Skew/Kurt | Bell-shaped, p > .05, |Skew| < 2, |Kurt| < 7 | Transform data or use robust/non-parametric tests |
| Homogeneity | Levene’s test, Boxplots | p > .05, Similar box heights | Use Welch’s correction or non-parametric tests |
| Homoscedasticity | Residual plot (fitted vs. residuals) | Random scatter, Equal spread | Transform outcome or use robust standard errors |
Let’s create a single comprehensive view of all our assumption checks:
# Create a comprehensive diagnostic dashboard
par(mfrow = c(3, 4), mar = c(4, 4, 3, 2))
# Row 1: Independence & Multicollinearity
# 1. Residuals over time (Independence)
plot(1:length(standardized), standardized, type = "l", col = "steelblue", lwd = 1.5,
main = "1. Independence Check",
xlab = "Observation Order", ylab = "Standardized Residuals",
ylim = c(-3, 3))
abline(h = 0, col = "red", lwd = 2)
abline(h = c(-2, 2), lty = 2, col = "gray")
# Add DW result
dw_stat <- durbinWatsonTest(fake)$dw
dw_status <- ifelse(dw_stat >= 1.5 & dw_stat <= 2.5, "✓ PASS", "⚠ CHECK")
text(length(standardized)/2, 2.5,
sprintf("DW = %.2f\n%s", dw_stat, dw_status),
col = ifelse(dw_status == "✓ PASS", "darkgreen", "orange"),
font = 2, cex = 1.1)
# 2. ACF plot
acf(standardized, main = "2. Autocorrelation Function",
col = "steelblue", lwd = 2)
# 3. Correlation heatmap (simplified)
cor_subset <- cor(cont_vars_for_cor[, 1:min(6, ncol(cont_vars_for_cor))])
image(1:nrow(cor_subset), 1:ncol(cor_subset), cor_subset,
col = colorRampPalette(c("blue", "white", "red"))(100),
main = "3. Correlation Matrix",
xlab = "", ylab = "", axes = FALSE)
axis(1, at = 1:nrow(cor_subset), labels = rownames(cor_subset), las = 2, cex.axis = 0.8)
axis(2, at = 1:ncol(cor_subset), labels = colnames(cor_subset), las = 2, cex.axis = 0.8)
max_cor <- max(abs(cor_subset[upper.tri(cor_subset)]))
cor_status <- ifelse(max_cor < 0.80, "✓ PASS", "⚠ CHECK")
text(nrow(cor_subset)/2, ncol(cor_subset) + 1,
sprintf("Max r = %.2f\n%s", max_cor, cor_status),
col = ifelse(cor_status == "✓ PASS", "darkgreen", "orange"),
font = 2, cex = 1.1)
# 4. VIF bar plot (top 6)
vif_top <- head(vif_df[order(-vif_df$VIF), ], 6)
barplot(vif_top$VIF, names.arg = vif_top$Variable,
col = ifelse(vif_top$VIF > 10, "red",
ifelse(vif_top$VIF > 5, "orange", "lightgreen")),
main = "4. VIF Values (Top 6)",
ylab = "VIF", las = 2, cex.names = 0.7)
abline(h = 10, col = "red", lwd = 2, lty = 2)
abline(h = 5, col = "orange", lwd = 2, lty = 2)
max_vif <- max(vif_df$VIF)
vif_status <- ifelse(max_vif < 10, "✓ PASS", "⚠ CHECK")
text(3, max(vif_top$VIF) * 0.9,
sprintf("Max VIF = %.1f\n%s", max_vif, vif_status),
col = ifelse(vif_status == "✓ PASS", "darkgreen", "orange"),
font = 2, cex = 1.1)
# Row 2: Linearity & Normality
# 5. Q-Q plot
qqnorm(standardized, main = "5. Q-Q Plot (Normality)",
pch = 19, col = rgb(0, 0, 1, 0.5))
qqline(standardized, col = "red", lwd = 2)
outside_2 <- sum(abs(standardized) > 2, na.rm = TRUE) / length(standardized) * 100
qq_status <- ifelse(outside_2 <= 5, "✓ PASS", "⚠ CHECK")
text(-2, max(standardized, na.rm = TRUE) * 0.9,
sprintf("%.1f%% outside ±2\n%s", outside_2, qq_status),
col = ifelse(qq_status == "✓ PASS", "darkgreen", "orange"),
font = 2, cex = 1.1, adj = 0)
# 6. Histogram of residuals
hist(standardized, breaks = 15, col = "lightblue", border = "white",
main = "6. Histogram of Residuals",
xlab = "Standardized Residuals", ylab = "Frequency")
curve(dnorm(x, mean(standardized, na.rm=TRUE), sd(standardized, na.rm=TRUE)) *
length(standardized) * diff(range(standardized, na.rm=TRUE)) / 15,
add = TRUE, col = "red", lwd = 2)
shapiro_p <- shapiro.test(standardized)$p.value
shapiro_status <- ifelse(shapiro_p > 0.05, "✓ PASS", "⚠ CHECK")
text(mean(standardized, na.rm=TRUE),
max(hist(standardized, breaks = 15, plot = FALSE)$counts) * 0.9,
sprintf("Shapiro p = %.3f\n%s", shapiro_p, shapiro_status),
col = ifelse(shapiro_status == "✓ PASS", "darkgreen", "orange"),
font = 2, cex = 1.1)
# 7. Residual plot (Linearity)
plot(fitvalues, standardized, pch = 19, col = rgb(0, 0, 0, 0.5),
main = "7. Residual Plot (Linearity)",
xlab = "Fitted Values", ylab = "Residuals")
abline(h = 0, col = "red", lwd = 2)
abline(h = c(-2, 2), lty = 2, col = "gray")
lines(lowess(fitvalues, standardized), col = "blue", lwd = 2)
text(min(fitvalues, na.rm=TRUE), 2.5,
"Look for:\nRandom scatter",
col = "darkgreen", font = 2, adj = 0, cex = 0.9)
# 8. Scale-Location plot
sqrt_abs_resid <- sqrt(abs(standardized))
plot(fitvalues, sqrt_abs_resid, pch = 19, col = rgb(0, 0, 0, 0.5),
main = "8. Scale-Location\n(Homoscedasticity)",
xlab = "Fitted Values", ylab = "√|Residuals|")
lines(lowess(fitvalues, sqrt_abs_resid), col = "red", lwd = 2)
text(min(fitvalues, na.rm=TRUE), max(sqrt_abs_resid, na.rm=TRUE) * 0.9,
"Flat line = good",
col = "darkgreen", font = 2, adj = 0, cex = 0.9)
# Row 3: Homogeneity & Summary
# 9. Boxplot by Sex
boxplot(RS1 ~ Sex, data = noout,
col = c("lightblue", "lightpink"),
main = "9. RS1 by Sex",
ylab = "RS1 Score")
levene_sex_p <- leveneTest(noout$RS1, noout$Sex, center = median)$`Pr(>F)`[1]
levene_status <- ifelse(levene_sex_p > 0.05, "✓ PASS", "⚠ CHECK")
text(1.5, max(noout$RS1, na.rm=TRUE) * 0.95,
sprintf("Levene p = %.3f\n%s", levene_sex_p, levene_status),
col = ifelse(levene_status == "✓ PASS", "darkgreen", "orange"),
font = 2, cex = 1.1)
# 10. Boxplot by SES
boxplot(RS1 ~ SES, data = noout,
col = c("lightcoral", "lightyellow", "lightgreen"),
main = "10. RS1 by SES",
ylab = "RS1 Score")
levene_ses_p <- leveneTest(noout$RS1, noout$SES, center = median)$`Pr(>F)`[1]
levene_status2 <- ifelse(levene_ses_p > 0.05, "✓ PASS", "⚠ CHECK")
text(2, max(noout$RS1, na.rm=TRUE) * 0.95,
sprintf("Levene p = %.3f\n%s", levene_ses_p, levene_status2),
col = ifelse(levene_status2 == "✓ PASS", "darkgreen", "orange"),
font = 2, cex = 1.1)
# 11. Cook's Distance
cooks_d <- cooks.distance(fake)
plot(1:length(cooks_d), cooks_d, type = "h", col = "steelblue", lwd = 2,
main = "11. Cook's Distance",
xlab = "Observation", ylab = "Cook's D")
abline(h = 4/length(cooks_d), col = "red", lwd = 2, lty = 2)
influential <- sum(cooks_d > 4/length(cooks_d))
text(length(cooks_d)/2, max(cooks_d) * 0.9,
sprintf("%d influential\npoints", influential),
col = ifelse(influential == 0, "darkgreen", "orange"),
font = 2, cex = 1.1)
# 12. Summary Report
plot.new()
text(0.5, 0.95, "Assumption Check Summary", font = 2, cex = 1.5)
checks <- c(
sprintf("1. Independence: %s", ifelse(dw_stat >= 1.5 & dw_stat <= 2.5, "✓", "⚠")),
sprintf("2. Multicollinearity: %s", ifelse(max_vif < 10, "✓", "⚠")),
sprintf("3. Linearity: Visual check"),
sprintf("4. Normality: %s", ifelse(shapiro_p > 0.05, "✓", "⚠")),
sprintf("5. Homogeneity: %s", ifelse(levene_sex_p > 0.05 & levene_ses_p > 0.05, "✓", "⚠"))
)
y_pos <- seq(0.75, 0.35, length.out = 5)
for(i in 1:5) {
color <- ifelse(grepl("✓", checks[i]), "darkgreen", "orange")
text(0.1, y_pos[i], checks[i], adj = 0, cex = 1.2, col = color, font = 2)
}
text(0.5, 0.15, sprintf("Sample Size: N = %d", nrow(noout)),
cex = 1.2, font = 2)
overall_status <- sum(grepl("✓", checks))
text(0.5, 0.05,
sprintf("Overall: %d/5 checks passed", overall_status),
cex = 1.3, font = 2,
col = ifelse(overall_status >= 4, "darkgreen", "orange"))How to read this dashboard:
Status indicators: - ✓ PASS (green) = Assumption met - ⚠ CHECK (orange) = Potential violation, investigate further
When assumptions are violated, transformations can often help [1]. Here are the most common transformations and when to use them:
When to use each transformation [1]:
| Problem | Transformation | R_Code |
|---|---|---|
| Positive skew (mild) | Square root: sqrt(x) | sqrt(variable) |
| Positive skew (moderate) | Log: log(x) or log(x + 1) | log(variable + 1) |
| Positive skew (severe) | Reciprocal: 1/x or 1/(x + 1) | 1 / (variable + 1) |
| Negative skew | Reflect then transform: max(x) - x, then apply above | max(variable) - variable |
| Heteroscedasticity | Log outcome variable | log(outcome + 1) |
| Non-linearity | Add polynomial terms or use non-linear model | lm(y ~ x + I(x^2)) |
Important notes [1]: - Always add a small constant (like 1) before log/reciprocal if data contain zeros - Transformations change the interpretation of your results - Report both original and transformed statistics - If transformations don’t help, consider robust or non-parametric methods
| Violated_Assumption | Parametric_Fix | Non_Parametric_Alternative |
|---|---|---|
| Independence | Mixed models, time-series models | Account for clustering structure |
| Multicollinearity | Remove/combine predictors, PCA | Ridge/Lasso regression |
| Linearity | Add polynomial terms, splines | GAM (Generalized Additive Models) |
| Normality (small N) | Transform data, bootstrap | Mann-Whitney, Kruskal-Wallis |
| Normality (large N) | Proceed (CLT applies) | Standard tests are robust |
| Homogeneity | Welch’s correction | Mann-Whitney, Kruskal-Wallis |
| Homoscedasticity | Weighted least squares, robust SE | Rank-based tests |
Based on Field et al. [1]:
Using the mtcars dataset:
# Your code here
data(mtcars)
# Create fake regression
random_y <- rchisq(nrow(mtcars), 7)
fake_model <- lm(random_y ~ ., data = mtcars)
# Extract diagnostics
std_resid <- rstudent(fake_model)
fitted_vals <- scale(fake_model$fitted.values)
# 1. Independence: Durbin-Watson
durbinWatsonTest(fake_model)
# 2. Multicollinearity: VIF
vif(fake_model)
# 3. Linearity: Q-Q plot
qqnorm(std_resid)
qqline(std_resid)
# 4. Normality: Shapiro-Wilk
shapiro.test(std_resid)
# 5. Homoscedasticity: Residual plot
plot(fitted_vals, std_resid)
abline(h = 0, col = "red")Given the following residual plots, identify which assumption is violated:
Questions: - Plot A: Which assumption is violated? - Plot B: Which assumption is violated? - Plot C: Are any assumptions violated? - Plot D: What might this pattern indicate?
Answers: - Plot A: Homoscedasticity (funnel pattern - variance increases with fitted values) - Plot B: Linearity (curved pattern - non-linear relationship) - Plot C: None - random scatter indicates assumptions are met - Plot D: Missing predictor or two distinct groups not accounted for in the model
Given the following VIF values, which variables should you be concerned about?
| Variable | VIF |
|---|---|
| Age | 2.3 |
| Income | 15.7 |
| Education | 3.1 |
| Experience | 14.2 |
| Salary | 18.9 |
Question: Which variables have problematic multicollinearity? What would you do?
Answer: Income (VIF = 15.7), Experience (VIF = 14.2), and Salary (VIF = 18.9) all exceed the threshold of 10, indicating serious multicollinearity [2]. These variables likely measure similar constructs (e.g., Income and Salary are probably highly correlated). Solution: Remove one or two of these variables, or combine them into a single “economic status” composite variable.
Additivity: The assumption that predictors contribute independently to the outcome; violated when multicollinearity is present [2].
Central Limit Theorem: States that in large samples (N > 30), the sampling distribution tends to be normal regardless of the population distribution [6].
Durbin-Watson test: Tests for autocorrelation in residuals; values near 2 indicate independence [1].
Fake regression: A regression model created solely for diagnostic purposes, not to test hypotheses [1].
Homogeneity of variance: Equal variances across categorical groups; tested with Levene’s test [7].
Homoscedasticity: Equal variance of residuals across the range of predicted values [8].
Independence: The assumption that observations (or residuals) are not correlated with each other [1].
Kurtosis: Measure of the “peakedness” of a distribution; excess kurtosis = 0 for normal distributions [5].
Levene’s test: Statistical test for homogeneity of variance across groups [7].
Linearity: The assumption that relationships between variables are linear (straight lines) rather than curved [3].
Multicollinearity: High correlation between predictor variables (r > .80 or .90); makes coefficients unstable [2].
Normality: The assumption that the sampling distribution (not necessarily raw data) is normally distributed [6].
Q-Q plot: Quantile-quantile plot comparing observed data to theoretical normal distribution [5].
Residual: The difference between observed and predicted values; also called “error” [4].
Shapiro-Wilk test: Statistical test for normality; significant p-values indicate non-normality [5].
Skewness: Measure of asymmetry in a distribution; 0 = symmetric, positive = right skew, negative = left skew [5].
Standardized residual: Residual converted to z-score (mean = 0, SD = 1) for easier interpretation [4].
Variance Inflation Factor (VIF): Measure of multicollinearity; VIF > 10 indicates serious problems [2].
[1] A. Field, J. Miles, and Z. Field, Discovering Statistics Using R. London: SAGE Publications, 2012.
[2] A. Field, J. Miles, and Z. Field, Discovering Statistics Using R. London, UK: SAGE Publications, 2012, pp. 274-276. [Multicollinearity and additivity]
[3] A. Field, J. Miles, and Z. Field, Discovering Statistics Using R. London, UK: SAGE Publications, 2012, pp. 293-295. [Linearity assumption and residual plots]
[4] A. Field, J. Miles, and Z. Field, Discovering Statistics Using R. London, UK: SAGE Publications, 2012, pp. 269-271. [Standardized residuals and error distributions]
[5] A. Field, J. Miles, and Z. Field, Discovering Statistics Using R. London, UK: SAGE Publications, 2012, pp. 179-182. [Q-Q plots and assessing normality]
[6] A. Field, J. Miles, and Z. Field, Discovering Statistics Using R. London, UK: SAGE Publications, 2012, pp. 168-169. [Central Limit Theorem and normality assumption]
[7] A. Field, J. Miles, and Z. Field, Discovering Statistics Using R. London, UK: SAGE Publications, 2012, pp. 185-188. [Homogeneity of variance and Levene’s test]
[8] A. Field, J. Miles, and Z. Field, Discovering Statistics Using R. London, UK: SAGE Publications, 2012, pp. 272-273, 293. [Homoscedasticity and residual plots]
## R version 4.5.2 (2025-10-31 ucrt)
## Platform: x86_64-w64-mingw32/x64
## Running under: Windows 11 x64 (build 26200)
##
## Matrix products: default
## LAPACK version 3.12.1
##
## locale:
## [1] LC_COLLATE=English_United States.utf8
## [2] LC_CTYPE=English_United States.utf8
## [3] LC_MONETARY=English_United States.utf8
## [4] LC_NUMERIC=C
## [5] LC_TIME=English_United States.utf8
##
## time zone: America/New_York
## tzcode source: internal
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] DiagrammeR_1.0.11 kableExtra_1.4.0 knitr_1.50 corrplot_0.95
## [5] moments_0.14.1 car_3.1-3 carData_3.0-5 mice_3.19.0
## [9] rio_1.2.4 lubridate_1.9.4 forcats_1.0.1 stringr_1.6.0
## [13] dplyr_1.1.4 purrr_1.2.0 readr_2.1.6 tidyr_1.3.1
## [17] tibble_3.3.0 ggplot2_4.0.1 tidyverse_2.0.0 seedhash_0.1.0
##
## loaded via a namespace (and not attached):
## [1] tidyselect_1.2.1 viridisLite_0.4.2 farver_2.1.2 R.utils_2.13.0
## [5] S7_0.2.1 fastmap_1.2.0 digest_0.6.37 rpart_4.1.24
## [9] timechange_0.3.0 lifecycle_1.0.4 survival_3.8-3 magrittr_2.0.4
## [13] compiler_4.5.2 rlang_1.1.6 sass_0.4.10 tools_4.5.2
## [17] yaml_2.3.10 data.table_1.17.8 labeling_0.4.3 htmlwidgets_1.6.4
## [21] xml2_1.5.1 RColorBrewer_1.1-3 abind_1.4-8 withr_3.0.2
## [25] R.oo_1.27.1 nnet_7.3-20 grid_4.5.2 jomo_2.7-6
## [29] scales_1.4.0 iterators_1.0.14 MASS_7.3-65 cli_3.6.5
## [33] rmarkdown_2.30 reformulas_0.4.2 generics_0.1.4 rstudioapi_0.17.1
## [37] tzdb_0.5.0 visNetwork_2.1.4 minqa_1.2.8 cachem_1.1.0
## [41] splines_4.5.2 vctrs_0.6.5 boot_1.3-32 glmnet_4.1-10
## [45] Matrix_1.7-4 jsonlite_2.0.0 hms_1.1.4 mitml_0.4-5
## [49] Formula_1.2-5 systemfonts_1.3.1 foreach_1.5.2 jquerylib_0.1.4
## [53] glue_1.8.0 nloptr_2.2.1 pan_1.9 codetools_0.2-20
## [57] stringi_1.8.7 shape_1.4.6.1 gtable_0.3.6 lme4_1.1-38
## [61] pillar_1.11.1 htmltools_0.5.8.1 R6_2.6.1 textshaping_1.0.4
## [65] Rdpack_2.6.4 evaluate_1.0.5 lattice_0.22-7 R.methodsS3_1.8.2
## [69] rbibutils_2.4 backports_1.5.0 broom_1.0.11 bslib_0.9.0
## [73] Rcpp_1.1.0 svglite_2.2.2 nlme_3.1-168 mgcv_1.9-3
## [77] xfun_0.54 pkgconfig_2.0.3
End of Week 07: R for Data Analytics Tutorial
ANLY 500 - Analytics I
Harrisburg University