Introduction

1. Overview

This Week 07 tutorial continues our data screening journey from Week 06, focusing on testing statistical assumptions. After cleaning data (accuracy, missing values, outliers), we must verify that our data meet the requirements of parametric statistical tests. This tutorial upgrades the lecture concepts from 07_datascreen_2.rmd into hands-on, code-first workflows:

Independence: Ensuring observations don’t influence each other
Additivity: Checking for multicollinearity among predictors
Linearity: Verifying linear relationships between variables
Normality: Assessing whether sampling distributions are normal
Homogeneity & Homoscedasticity: Testing equal variances across groups and conditions

Reference: This tutorial draws heavily from Field, Miles, and Field [1], Discovering Statistics Using R, particularly Chapters 5 and 7 on assumptions and regression diagnostics, with IEEE-style citations throughout.

Upgrade note: This hands-on tutorial extends the Week 07 lecture slides in 07_datascreen_2.rmd with code-first workflows, beginner-friendly explanations, and theoretical grounding from Field et al. [1].

For Beginners: Quick Start

How to run code:
- RStudio: place cursor in a chunk, press Ctrl+Enter (Cmd+Enter on Mac) to run; use Knit to render the full document.
- VS Code: highlight lines and press Ctrl+Enter, or click the Run button above a chunk.
Installing packages: if you see “there is no package called ‘X’”, run install.packages("X") once, then library(X).
If stuck: restart R (RStudio: Session → Restart R), then re-run chunks from the top (setup first).
Reading errors: focus on the last lines of the message; they usually state the actual problem (e.g., object not found, package missing, misspelled column).

2. Learning Objectives

By the end of this tutorial, you will be able to:

Test the independence assumption using the Durbin-Watson test
Detect multicollinearity using correlation matrices and VIF values
Assess linearity using Q-Q plots and residual plots
Evaluate normality using histograms, Q-Q plots, and the Shapiro-Wilk test
Check homogeneity of variance using Levene’s test
Test homoscedasticity using residual plots
Interpret diagnostic plots to identify assumption violations

3. Required Packages

# Install if needed
# install.packages(c("tidyverse", "rio", "mice", "car", "moments", "corrplot", "knitr", "kableExtra"))

library(tidyverse)
library(ggplot2)  # For data visualization
library(rio)      # For data import
library(mice)     # For multiple imputation
library(car)      # For VIF and Durbin-Watson tests
library(moments)  # For skewness and kurtosis
library(corrplot) # For correlation visualization
library(knitr)
library(kableExtra)

cat("\n=== REPRODUCIBLE SEED INFORMATION ===")

## 
## === REPRODUCIBLE SEED INFORMATION ===

cat("\nGenerator Name: Week07 Data Screening Part 2 - ANLY500")

## 
## Generator Name: Week07 Data Screening Part 2 - ANLY500

cat("\nMD5 Hash:", gen$get_hash())

## 
## MD5 Hash: cccb8a5cbee9f1dd300d496467230978

cat("\nAvailable Seeds:", paste(seeds[1:5], collapse = ", "), "...\n\n")

## 
## Available Seeds: 5912985, 166518175, 149568504, -691293678, -634303044 ...

Part 1: Understanding Statistical Assumptions

1.1 Why Do Assumptions Matter?

The fundamental problem: Parametric statistical tests (t-tests, ANOVA, regression) make specific assumptions about the data. When these assumptions are violated, our results become unreliable [1].

Analogy: Think of assumptions like the foundation of a house. You can build a beautiful house on a weak foundation, but it won’t stand for long. Similarly, you can run complex analyses on data that violate assumptions, but your conclusions will be shaky [1].

1.1.0 Visual Workflow: The Complete Assumption Checking Process

How to read this diagram: 1. Start with cleaned data (after Week 06 screening) 2. Create fake regression to generate diagnostic information 3. Check each assumption systematically (1-5) 4. Decision point: Are all assumptions met? - ✓ Yes → Proceed with your planned analysis - ✗ No → Fix violations and re-check

Key reminders: - Visual inspection is as important as statistical tests - Large samples (N ≥ 30) are more forgiving of normality violations - Some violations are more serious than others (independence is critical!)

1.1.1 The Five Key Assumptions

Field et al. [1] identify five critical assumptions for parametric statistics:

The Five Key Assumptions for Parametric Statistics [1]
Assumption	Plain_English	What_Happens_If_Violated	How_To_Check
Independence	One person’s score doesn’t influence another’s	Confidence intervals and p-values are wrong	Durbin-Watson test
Additivity	Predictors aren’t measuring the same thing	Can’t tell which predictor is important	Correlation matrix, VIF
Linearity	Relationships form straight lines, not curves	Model predictions are biased	Q-Q plots, residual plots
Normality	Sampling distribution is bell-shaped	p-values are unreliable (especially in small samples)	Histograms, Q-Q plots, Shapiro-Wilk
Homogeneity/Homoscedasticity	Variance is equal across groups/conditions	Type I error rate increases	Levene’s test, residual plots

1.1.2 Visual Summary: All Five Assumptions at a Glance

Quick reference guide: - Row 1 (Green): What GOOD looks like for each assumption - Row 2 (Red): What BAD looks like - violations to watch for - Row 3: Homogeneity/Homoscedasticity examples (both good and bad)

Use this as a visual checklist when examining your own data!

1.2 The “Fake Regression” Approach

What is a fake regression? We create a regression model not to test a hypothesis, but as a diagnostic tool to check assumptions [1]. This approach works because:

Many assumptions apply to the residuals (errors) of a model
Regression provides easy access to residuals and diagnostic plots
One set of checks works for multiple types of analyses

Why “fake”? We’re not interested in the regression coefficients or p-values. We only care about the diagnostic information it provides [1].

1.2.1 Visual Concept: What Are Residuals?

Key insight: Residuals tell us how wrong our predictions are. If assumptions are met, residuals should be random and well-behaved [1].

1.2.2 Types of Residuals: Understanding the Differences

Field et al. [1] describe several types of residuals. Understanding the differences helps you choose the right diagnostic:

Key differences [1]:

Types of Residuals and Diagnostics [1]
Type	Formula	Interpretation	Threshold	Best_For
Raw Residual	e = y - ŷ	Basic error in original units	No standard threshold	Understanding raw errors
Standardized Residual	e / SD(e)	Error in standard deviations	\|z\| > 2 or 3	Quick screening
Studentized Residual	e / SD(e without case i)	Error accounting for case’s leverage	\|t\| > 2 or 3	Outlier detection
Leverage	Diagonal of hat matrix	Potential to influence predictions	> 2(k+1)/n or 3(k+1)/n	Identifying unusual X
Cook’s Distance	Combines residual & leverage	Actual influence on model	> 1 (concern), > 4/n (investigate)	Overall influence assessment

In this tutorial, we use studentized residuals because they are the most appropriate for assumption checking [1].

Real vs. Fake Regression [1]
Component	Real_Regression	Fake_Regression
Y (Outcome)	Your actual outcome variable	Random chi-square variable
X (Predictors)	Your actual predictors	All variables in your dataset
Purpose	Test hypotheses about relationships	Generate diagnostics for assumption checking
What We Examine	Coefficients, p-values, R²	Residuals, Q-Q plots, residual plots

Part 2: Preparing the Data

2.1 Loading Pre-Screened Data

We’ll use the same dataset from Week 06, picking up where we left off after removing outliers.

# Import the dataset
# Note: Ensure "data/data_screening.csv" exists in your working directory
if(file.exists("data/data_screening.csv")) {
  master <- import("data/data_screening.csv")
} else {
  # Simulation for tutorial purposes if file is missing
  set.seed(seeds[2])
  n <- 100
  master <- data.frame(
    Sex = sample(c(1, 2), n, replace = TRUE),
    SES = sample(c(1, 2, 3), n, replace = TRUE),
    Grade = sample(9:12, n, replace = TRUE),
    Absences = sample(0:15, n, replace = TRUE),
    RS1 = sample(1:7, n, replace = TRUE),
    RS2 = sample(1:7, n, replace = TRUE),
    RS3 = sample(1:7, n, replace = TRUE),
    RS4 = sample(1:7, n, replace = TRUE),
    RS5 = sample(1:7, n, replace = TRUE),
    RS6 = sample(1:7, n, replace = TRUE),
    RS7 = sample(1:7, n, replace = TRUE),
    RS8 = sample(1:7, n, replace = TRUE),
    RS9 = sample(1:7, n, replace = TRUE),
    RS10 = sample(1:7, n, replace = TRUE)
  )
  # Introduce some missingness
  master[sample(1:n, 5), "RS1"] <- NA
  master[sample(1:n, 3), "RS5"] <- NA
}

str(master)

## 'data.frame':    137 obs. of  20 variables:
##  $ Sex     : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ Age     : int  17 16 13 15 16 15 12 14 13 18 ...
##  $ SES     : int  2 3 2 3 2 3 3 3 1 3 ...
##  $ Grade   : int  11 7 5 6 11 7 4 6 4 9 ...
##  $ Absences: int  2 2 2 2 6 2 1 2 2 5 ...
##  $ RS1     : int  6 7 5 5 7 7 1 2 6 7 ...
##  $ RS2     : int  4 1 NA 5 4 7 6 3 6 6 ...
##  $ RS3     : int  2 1 5 5 4 7 NA 2 7 5 ...
##  $ RS4     : int  2 5 7 7 7 7 1 3 6 NA ...
##  $ RS5     : int  4 7 4 5 4 7 4 2 1 6 ...
##  $ RS6     : int  7 7 6 6 4 7 7 3 4 6 ...
##  $ RS7     : int  7 7 7 5 7 7 7 3 6 10 ...
##  $ RS8     : int  4 7 6 7 7 7 NA 3 6 7 ...
##  $ RS9     : int  5 4 6 6 4 7 NA 2 2 7 ...
##  $ RS10    : int  7 7 7 7 7 7 4 2 5 7 ...
##  $ RS11    : int  4 1 NA 6 7 7 3 3 6 5 ...
##  $ RS12    : int  7 7 5 6 4 7 7 3 6 6 ...
##  $ RS13    : int  4 4 7 6 7 7 7 2 6 6 ...
##  $ RS14    : int  7 7 6 6 4 7 7 3 2 6 ...
##  $ Health  : int  6 6 2 2 6 4 3 6 1 5 ...

2.2 Quick Data Screening Review

Let’s quickly apply the Week 06 screening steps to prepare our data.

# Step 1: Fix categorical variables
notypos <- master
notypos$Sex <- factor(notypos$Sex, 
                     levels = c(1,2),
                     labels = c("Women", "Men"))
notypos$SES <- factor(notypos$SES, 
                     levels = c(1,2, 3),
                     labels = c("Low", "Medium", "High"))

# Step 2: Fix out-of-range values (if any)
notypos$Grade[ notypos$Grade > 15 ] <- NA
rs_cols <- grep("^RS", names(notypos))
notypos[ , rs_cols][ notypos[ , rs_cols] > 7 ] <- NA

# Step 3: Handle missing data
percentmiss <- function(x){ sum(is.na(x))/length(x) * 100 }
missing <- apply(notypos, 1, percentmiss)
replace_rows <- subset(notypos, missing <= 5)

# Step 4: Multiple imputation
temp_no_miss <- mice(replace_rows, printFlag = FALSE, seed = seeds[3])
nomiss <- complete(temp_no_miss, 1)
all_columns <- nomiss

# Step 5: Remove multivariate outliers
cont_vars_cols <- c("Grade", "Absences", paste0("RS", 1:10))
cont_vars <- all_columns[, cont_vars_cols]
mahal <- mahalanobis(cont_vars,
                    colMeans(cont_vars, na.rm=TRUE),
                    cov(cont_vars, use ="pairwise.complete.obs"))
cutoff <- qchisq(1-.001, ncol(cont_vars))
noout <- subset(all_columns, mahal < cutoff)

cat("Final sample size after screening:", nrow(noout), "\n")

## Final sample size after screening: 123

Interpretation: We now have a clean dataset with no typos, no missing values, and no extreme outliers. We’re ready to check assumptions!

2.2.1 Understanding Influential Cases

Before we check assumptions, let’s understand what makes a case “influential” [1]:

Key concepts from Field et al. [1]:

Outliers: Cases with large residuals (errors) - they don’t fit the model well
Influential cases: Cases that, if removed, would substantially change the regression coefficients
Leverage: Cases with unusual predictor values that have potential to influence the model

Key takeaways [1]: - Outliers have large residuals but may not influence the model much - High leverage points have unusual X values and potential to influence - Influential cases combine both: unusual X + large residual = changes the model substantially - Cook’s Distance > 1 indicates a case that substantially influences the regression [1]

2.3 Data Overview Visualization

# Create a comprehensive data overview
par(mfrow = c(2, 3))

# 1. Sample size by groups
sex_counts <- table(noout$Sex)
barplot(sex_counts, col = c("lightblue", "lightpink"),
        main = "Sample Size by Sex",
        ylab = "Count", ylim = c(0, max(sex_counts) * 1.2))
text(x = 1:length(sex_counts), y = sex_counts + max(sex_counts) * 0.05,
     labels = sex_counts, font = 2)

ses_counts <- table(noout$SES)
barplot(ses_counts, col = c("lightcoral", "lightyellow", "lightgreen"),
        main = "Sample Size by SES",
        ylab = "Count", ylim = c(0, max(ses_counts) * 1.2))
text(x = 1:length(ses_counts), y = ses_counts + max(ses_counts) * 0.05,
     labels = ses_counts, font = 2)

# 2. Distribution of continuous variables
hist(noout$Grade, breaks = 10, col = "steelblue", border = "white",
     main = "Distribution: Grade",
     xlab = "Grade Level", ylab = "Frequency")

hist(noout$Absences, breaks = 10, col = "orange", border = "white",
     main = "Distribution: Absences",
     xlab = "Number of Absences", ylab = "Frequency")

# 3. Example resilience scores
hist(noout$RS1, breaks = 7, col = "purple", border = "white",
     main = "Distribution: Resilience Score 1",
     xlab = "RS1 Score (1-7)", ylab = "Frequency")

# 4. Summary statistics text
plot.new()
text(0.5, 0.9, "Dataset Summary", font = 2, cex = 1.5)
text(0.1, 0.75, sprintf("Total N: %d", nrow(noout)), adj = 0, cex = 1.2)
text(0.1, 0.65, sprintf("Variables: %d", ncol(noout)), adj = 0, cex = 1.2)
text(0.1, 0.55, sprintf("Continuous: %d", length(grep("^RS|Grade|Absences", names(noout)))), adj = 0, cex = 1.2)
text(0.1, 0.45, sprintf("Categorical: 2 (Sex, SES)"), adj = 0, cex = 1.2)
text(0.1, 0.30, "Ready for assumption testing!", adj = 0, cex = 1.3, col = "darkgreen", font = 2)

par(mfrow = c(1, 1))

Part 3: Testing Independence

3.1 What is Independence?

Definition [1]: The independence assumption states that the errors (residuals) in your model should not be related to each other.

Plain English Example: Imagine you’re measuring test scores from different students. Independence means that one student’s score shouldn’t influence another student’s score. If students copied from each other, their scores would be related (not independent), which would violate this assumption [1].

When is independence violated?

Time series data: Observations close in time are often correlated
Clustered data: Students within the same classroom, patients within the same hospital
Repeated measures: The same person measured multiple times

3.1.1 The Durbin-Watson Test

The Durbin-Watson test checks whether adjacent residuals (errors) are correlated [1].

Test statistic interpretation: - DW ≈ 2: Residuals are uncorrelated (GOOD!) - DW < 1 or DW > 3: Serious concern about independence [1] - DW < 2: Positive autocorrelation (adjacent errors are similar) - DW > 2: Negative autocorrelation (adjacent errors alternate)

3.1.2 Visualizing Independence vs. Autocorrelation

Interpretation: - Top row: Time series plots showing how residuals change over time - Bottom row: ACF (Autocorrelation Function) plots - bars outside blue lines indicate significant correlation - Independent residuals: Random pattern, no correlation between adjacent points - Positive autocorrelation: Smooth waves - if one error is positive, the next tends to be positive too - Negative autocorrelation: Zigzag pattern - errors alternate between positive and negative

3.2 Creating the Fake Regression

# Create a random outcome variable (chi-square distributed)
# Why chi-square? Because errors should have lots of small values, few large ones
random <- rchisq(nrow(noout), 7)  # df=7 works well for most cases

# Create the fake regression
# Y ~ . means "predict random from ALL other variables"
fake <- lm(random ~ ., data = noout)

# Extract standardized residuals
standardized <- rstudent(fake)

# Extract standardized fitted values
fitvalues <- scale(fake$fitted.values)

cat("Fake regression created successfully!\n")

## Fake regression created successfully!

cat("Number of observations:", length(standardized), "\n")

## Number of observations: 123

cat("Number of predictors:", length(coef(fake)) - 1, "\n")

## Number of predictors: 21

Code explanation: - rchisq(n, df): Generates random numbers from a chi-square distribution - lm(random ~ .): Creates a linear model predicting random from all variables - rstudent(): Calculates studentized residuals (standardized, accounting for leverage) - scale(): Standardizes fitted values to have mean=0, sd=1

3.3 Running the Durbin-Watson Test

# Perform Durbin-Watson test
dw_result <- durbinWatsonTest(fake)
print(dw_result)

##  lag Autocorrelation D-W Statistic p-value
##    1      -0.1031872      2.187277    0.48
##  Alternative hypothesis: rho != 0

# Interpretation helper
dw_stat <- dw_result$dw
if(dw_stat >= 1 && dw_stat <= 3) {
  cat("\n✓ GOOD: Durbin-Watson statistic is", round(dw_stat, 3), 
      "\n  This is within acceptable range (1-3).")
  cat("\n  Independence assumption is likely met.\n")
} else {
  cat("\n✗ WARNING: Durbin-Watson statistic is", round(dw_stat, 3),
      "\n  This is outside acceptable range (1-3).")
  cat("\n  Independence assumption may be violated.\n")
}

## 
## ✓ GOOD: Durbin-Watson statistic is 2.187 
##   This is within acceptable range (1-3).
##   Independence assumption is likely met.

Interpretation guide:

Durbin-Watson Interpretation Guide [1]
DW_Value	Interpretation	Action
< 1	Strong positive autocorrelation - CONCERN	Consider time-series models or mixed models
1 to 1.5	Mild positive autocorrelation - Monitor	Check data collection order; may be acceptable
1.5 to 2.5	No autocorrelation - GOOD	Proceed with standard analyses
2.5 to 3	Mild negative autocorrelation - Monitor	Check data collection order; may be acceptable
> 3	Strong negative autocorrelation - CONCERN	Consider time-series models or mixed models

Part 4: Testing Additivity (Multicollinearity)

4.1 What is Additivity?

Definition [1]: If you have several predictors, their combined effect should be best described by adding their effects together.

Plain English Example: Think of additivity like ingredients in a recipe. If you’re predicting cake quality from flour and sugar, each ingredient should contribute its own unique effect. If flour and sugar were essentially measuring the same thing (like two different brands of the same ingredient), you’d be counting the same effect twice [2].

The problem: Multicollinearity

When predictors are too highly correlated (r > .80 or .90), we have multicollinearity [2]. This causes:

Untrustworthy coefficients: Standard errors inflate, making coefficients unstable
Can’t determine importance: If two predictors are highly correlated, which one is actually important?
Limits R²: Correlated predictors explain overlapping variance

4.1.1 Visual Concept: Understanding Multicollinearity

Key insight: When predictors are highly correlated (right plot), they’re essentially measuring the same thing. Including both doesn’t add much new information and makes it hard to determine which one is actually important [2].

4.2 Checking Correlations

# Select only continuous variables for correlation check
cont_vars_for_cor <- noout %>% 
  select(Grade, Absences, starts_with("RS"))

# Calculate correlation matrix
cor_matrix <- cor(cont_vars_for_cor)

# Display correlation matrix
cor_matrix %>%
  kable(caption = "Correlation Matrix of Continuous Variables", digits = 3) %>%
  kable_styling(bootstrap_options = c("striped", "hover"), 
                full_width = FALSE, font_size = 11) %>%
  scroll_box(width = "100%", height = "400px")

Correlation Matrix of Continuous Variables
	Grade	Absences	RS1	RS2	RS3	RS4	RS5	RS6	RS7	RS8	RS9	RS10	RS11	RS12	RS13	RS14
Grade	1.000	-0.087	0.116	0.047	0.118	-0.007	0.114	0.185	0.228	0.210	0.022	0.184	0.082	0.074	0.106	0.178
Absences	-0.087	1.000	0.070	0.152	0.234	0.115	0.102	0.020	0.181	0.074	-0.071	0.146	0.214	0.028	0.161	-0.036
RS1	0.116	0.070	1.000	0.289	0.386	0.309	0.287	0.276	0.496	0.290	0.324	0.410	0.310	0.317	0.296	0.211
RS2	0.047	0.152	0.289	1.000	0.586	0.405	0.270	0.538	0.587	0.444	0.523	0.652	0.633	0.322	0.637	0.337
RS3	0.118	0.234	0.386	0.586	1.000	0.533	0.430	0.419	0.465	0.439	0.330	0.429	0.536	0.329	0.612	0.392
RS4	-0.007	0.115	0.309	0.405	0.533	1.000	0.329	0.244	0.279	0.191	0.332	0.443	0.390	0.109	0.393	0.213
RS5	0.114	0.102	0.287	0.270	0.430	0.329	1.000	0.372	0.335	0.361	0.395	0.391	0.452	0.478	0.401	0.543
RS6	0.185	0.020	0.276	0.538	0.419	0.244	0.372	1.000	0.603	0.460	0.581	0.499	0.584	0.458	0.613	0.532
RS7	0.228	0.181	0.496	0.587	0.465	0.279	0.335	0.603	1.000	0.512	0.482	0.613	0.541	0.436	0.545	0.438
RS8	0.210	0.074	0.290	0.444	0.439	0.191	0.361	0.460	0.512	1.000	0.479	0.492	0.472	0.512	0.571	0.412
RS9	0.022	-0.071	0.324	0.523	0.330	0.332	0.395	0.581	0.482	0.479	1.000	0.647	0.626	0.513	0.556	0.549
RS10	0.184	0.146	0.410	0.652	0.429	0.443	0.391	0.499	0.613	0.492	0.647	1.000	0.636	0.452	0.576	0.492
RS11	0.082	0.214	0.310	0.633	0.536	0.390	0.452	0.584	0.541	0.472	0.626	0.636	1.000	0.497	0.740	0.380
RS12	0.074	0.028	0.317	0.322	0.329	0.109	0.478	0.458	0.436	0.512	0.513	0.452	0.497	1.000	0.567	0.529
RS13	0.106	0.161	0.296	0.637	0.612	0.393	0.401	0.613	0.545	0.571	0.556	0.576	0.740	0.567	1.000	0.539
RS14	0.178	-0.036	0.211	0.337	0.392	0.213	0.543	0.532	0.438	0.412	0.549	0.492	0.380	0.529	0.539	1.000

4.2.1 Visualizing Correlations

# Create a correlation plot
corrplot(cor_matrix, 
         method = "color",
         type = "upper",
         order = "hclust",
         addCoef.col = "black",
         number.cex = 0.7,
         tl.col = "black",
         tl.srt = 45,
         title = "Correlation Matrix: Checking for Multicollinearity",
         mar = c(0,0,2,0))

# Add interpretation guide
legend("bottomright", 
       legend = c("r > 0.90: CONCERN", "r > 0.80: Monitor", "r < 0.80: OK"),
       fill = c("darkred", "orange", "lightblue"),
       cex = 0.8)

Interpretation: - Dark blue/red: Strong correlations - Light colors: Weak correlations - Look for: Any correlations > 0.80 or 0.90 [2]

# Find high correlations (excluding diagonal)
high_cor <- which(abs(cor_matrix) > 0.80 & abs(cor_matrix) < 1, arr.ind = TRUE)

if(nrow(high_cor) > 0) {
  cat("⚠ WARNING: Found", nrow(high_cor)/2, "pairs with correlations > 0.80:\n\n")
  for(i in 1:nrow(high_cor)) {
    if(high_cor[i,1] < high_cor[i,2]) {  # Avoid duplicates
      var1 <- rownames(cor_matrix)[high_cor[i,1]]
      var2 <- colnames(cor_matrix)[high_cor[i,2]]
      cor_val <- cor_matrix[high_cor[i,1], high_cor[i,2]]
      cat(sprintf("  %s <-> %s: r = %.3f\n", var1, var2, cor_val))
    }
  }
  cat("\nConsider combining these variables or selecting one from each pair.\n")
} else {
  cat("✓ GOOD: No correlations exceed 0.80.\n")
  cat("  Multicollinearity is not a concern based on correlations.\n")
}

## ✓ GOOD: No correlations exceed 0.80.
##   Multicollinearity is not a concern based on correlations.

4.3 Variance Inflation Factor (VIF)

What is VIF? The Variance Inflation Factor measures how much the variance of a regression coefficient is inflated due to multicollinearity [2].

Interpretation: - VIF < 10: Generally acceptable [2] - VIF > 10: Serious multicollinearity problem [2] - Average VIF >> 1: Possible bias in the model [2]

# Calculate VIF for each predictor
# Note: VIF requires a real regression, so we use our fake regression
# VIF can only be calculated for numeric predictors, so we need to handle factors
vif_values <- vif(fake)

# Handle the case where VIF returns a matrix (for factors with multiple levels)
if(is.matrix(vif_values)) {
  # Extract GVIF^(1/(2*Df)) for factors, regular VIF for numeric
  vif_clean <- vif_values[, "GVIF^(1/(2*Df))"]
} else {
  vif_clean <- vif_values
}

# Create a nice table
vif_df <- data.frame(
  Variable = names(vif_clean),
  VIF = as.numeric(vif_clean),
  Status = ifelse(as.numeric(vif_clean) > 10, "⚠ CONCERN", 
                  ifelse(as.numeric(vif_clean) > 5, "Monitor", "✓ OK"))
)

vif_df %>%
  arrange(desc(VIF)) %>%
  kable(caption = "Variance Inflation Factors (VIF)", digits = 3) %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
  row_spec(which(vif_df$VIF > 10), bold = TRUE, color = "white", background = "darkred") %>%
  row_spec(which(vif_df$VIF > 5 & vif_df$VIF <= 10), bold = TRUE, background = "orange")

Variance Inflation Factors (VIF)
Variable	VIF	Status
RS13	2.052	✓ OK
RS11	1.934	✓ OK
RS10	1.822	✓ OK
RS2	1.769	✓ OK
RS9	1.733	✓ OK
RS3	1.732	✓ OK
RS7	1.729	✓ OK
RS6	1.604	✓ OK
RS14	1.571	✓ OK
RS12	1.466	✓ OK
RS4	1.446	✓ OK
RS5	1.422	✓ OK
Age	1.396	✓ OK
RS8	1.393	✓ OK
Grade	1.306	✓ OK
RS1	1.263	✓ OK
Absences	1.234	✓ OK
SES	1.217	✓ OK
Sex	1.131	✓ OK
Health	1.083	✓ OK

# Overall assessment
avg_vif <- mean(as.numeric(vif_clean), na.rm = TRUE)
cat("\nAverage VIF:", round(avg_vif, 3), "\n")

## 
## Average VIF: 1.515

if(max(as.numeric(vif_clean), na.rm = TRUE) < 10 && avg_vif < 2) {
  cat("✓ GOOD: All VIF values < 10 and average VIF < 2.\n")
  cat("  Multicollinearity is not a concern.\n")
} else if(max(as.numeric(vif_clean), na.rm = TRUE) >= 10) {
  cat("⚠ WARNING: Some VIF values ≥ 10.\n")
  cat("  Consider removing or combining highly correlated predictors.\n")
} else {
  cat("⚠ CAUTION: Average VIF > 2.\n")
  cat("  Multicollinearity may be biasing the model.\n")
}

## ✓ GOOD: All VIF values < 10 and average VIF < 2.
##   Multicollinearity is not a concern.

4.3.1 Visualizing VIF

# Create a bar plot of VIF values
ggplot(vif_df, aes(x = reorder(Variable, VIF), y = VIF, fill = Status)) +
  geom_bar(stat = "identity", color = "black") +
  geom_hline(yintercept = 10, linetype = "dashed", color = "red", linewidth = 1) +
  geom_hline(yintercept = 5, linetype = "dashed", color = "orange", linewidth = 1) +
  annotate("text", x = 1, y = 10.5, label = "VIF = 10 (Concern)", hjust = 0, color = "red") +
  annotate("text", x = 1, y = 5.5, label = "VIF = 5 (Monitor)", hjust = 0, color = "orange") +
  coord_flip() +
  labs(title = "Variance Inflation Factors (VIF)",
       subtitle = "Lower is better; VIF > 10 indicates multicollinearity",
       x = "Variable",
       y = "VIF Value") +
  scale_fill_manual(values = c("✓ OK" = "lightgreen", 
                               "Monitor" = "orange", 
                               "⚠ CONCERN" = "darkred")) +
  theme_minimal() +
  theme(legend.position = "bottom")

Part 5: Testing Linearity

5.1 What is Linearity?

Definition [3]: The linearity assumption states that the relationship between variables is linear (straight line) and not curved [3].

Plain English Example: Imagine plotting height vs. weight. Linearity means the relationship forms a straight line (or close to it), not a curve. If taller people get heavier at a steady rate, that’s linear. If the relationship curves (like exponential growth), that violates linearity [3].

Why it matters: Most parametric statistics (ANOVA, regression, t-tests) assume linear relationships. If relationships are curved, our models will make biased predictions [3].

5.1.1 Visual Concept: Linear vs. Non-Linear Relationships

Key insight: - Top row: Original data with fitted lines. Linear model (red line) fits well only for linear data. - Bottom row: Residual plots reveal problems. Random scatter = good; patterns = bad [3].

5.2 Q-Q Plots for Normality of Residuals

What is a Q-Q plot? QQ stands for “quantile-quantile.” This plot compares your data to what we’d expect from a perfect normal distribution [5]. If the dots follow the diagonal line closely, your data are normally distributed [5].

# Create Q-Q plot
qqnorm(standardized, main = "Q-Q Plot: Assessing Normality of Residuals")
qqline(standardized, col = "red", lwd = 2)

# Add interpretation zones
abline(h = c(-2, 2), lty = 2, col = "blue")
text(-2, 2.5, "Most data should\nfall between -2 and +2", col = "blue", cex = 0.9)

How to interpret:

Dots on the line (especially between -2 and +2): GOOD! Residuals are normally distributed [5]
S-shaped curve: Skewness problem
Curved at extremes only: Kurtosis problem (heavy or light tails)
Systematic departure from line: Non-normality

# Calculate how many points fall outside ±2
outside_range <- sum(abs(standardized) > 2, na.rm = TRUE)
percent_outside <- (outside_range / length(standardized)) * 100

cat("Standardized residuals outside ±2:", outside_range, 
    sprintf("(%.1f%%)\n", percent_outside))

## Standardized residuals outside ±2: 4 (3.3%)

if(percent_outside <= 5) {
  cat("✓ GOOD: ≤5% of residuals fall outside ±2.\n")
  cat("  This is expected for normally distributed data.\n")
} else {
  cat("⚠ WARNING: >5% of residuals fall outside ±2.\n")
  cat("  Normality assumption may be violated.\n")
}

## ✓ GOOD: ≤5% of residuals fall outside ±2.
##   This is expected for normally distributed data.

5.2.1 Enhanced Q-Q Plot with ggplot2

# Create a data frame for plotting
qq_data <- data.frame(
  Theoretical = qqnorm(standardized, plot.it = FALSE)$x,
  Sample = qqnorm(standardized, plot.it = FALSE)$y,
  Outlier = abs(standardized) > 2
)

# Create enhanced Q-Q plot
ggplot(qq_data, aes(x = Theoretical, y = Sample)) +
  geom_point(aes(color = Outlier), size = 2, alpha = 0.6) +
  geom_abline(intercept = 0, slope = 1, color = "red", linewidth = 1) +
  geom_hline(yintercept = c(-2, 2), linetype = "dashed", color = "blue") +
  geom_vline(xintercept = c(-2, 2), linetype = "dashed", color = "blue") +
  scale_color_manual(values = c("FALSE" = "black", "TRUE" = "red"),
                     labels = c("Within ±2", "Outside ±2")) +
  labs(title = "Q-Q Plot: Normality of Standardized Residuals",
       subtitle = "Points should fall on the red diagonal line",
       x = "Theoretical Quantiles",
       y = "Sample Quantiles (Standardized Residuals)",
       color = "Status") +
  theme_minimal() +
  theme(legend.position = "bottom")

5.3 Residual Plots for Linearity

What are residual plots? They plot the predicted values (X-axis) against the residuals/errors (Y-axis). If the model is good, residuals should be randomly scattered around zero [3].

# Create residual plot
plot(fitvalues, standardized,
     main = "Residual Plot: Checking Linearity",
     xlab = "Standardized Fitted Values",
     ylab = "Standardized Residuals",
     pch = 19, col = rgb(0, 0, 0, 0.5))
abline(h = 0, col = "red", lwd = 2)
abline(v = 0, col = "red", lwd = 2)
abline(h = c(-2, 2), lty = 2, col = "blue")

# Add interpretation text
text(min(fitvalues, na.rm=TRUE), 3, 
     "Look for: Random scatter\nNo curves or patterns",
     adj = 0, col = "darkgreen", cex = 0.9)

What to look for:

Part 6: Testing Normality

6.1 Understanding the Normality Assumption

Common misconception: “My data need to be normally distributed.”

The truth [6]: The assumption is that the sampling distribution is normally distributed, not necessarily your raw data [6].

Plain English: We don’t need every individual score to be perfectly normal. What matters is that if we took many samples and calculated their means, those means would form a normal distribution [6].

6.1.1 The Central Limit Theorem

The Central Limit Theorem [6]: In large samples (N > 30), the sampling distribution tends to be normal anyway, regardless of how the original data look [6].

Practical implication: With large samples, normality is less of a concern. Focus on outliers instead [6].

6.1.2 Visualizing the Central Limit Theorem

The magic of the Central Limit Theorem: - Left: Original population is highly skewed (NOT normal) - Middle: With small samples (n=5), means are still somewhat skewed - Right: With n≥30, sample means become normally distributed!

Conclusion: Even if your raw data are skewed, your statistical tests can still be valid if your sample size is large enough [6].

cat("Sample size:", nrow(noout), "\n")

## Sample size: 123

if(nrow(noout) >= 30) {
  cat("✓ GOOD: Sample size ≥ 30.\n")
  cat("  Central Limit Theorem applies - sampling distribution should be normal.\n")
  cat("  Normality of raw data is less critical.\n")
} else {
  cat("⚠ CAUTION: Sample size < 30.\n")
  cat("  Normality of raw data is more important in small samples.\n")
}

## ✓ GOOD: Sample size ≥ 30.
##   Central Limit Theorem applies - sampling distribution should be normal.
##   Normality of raw data is less critical.

6.2 Histogram of Residuals

# Create histogram of standardized residuals
hist(standardized, 
     breaks = 15,
     main = "Histogram of Standardized Residuals",
     xlab = "Standardized Residuals",
     ylab = "Frequency",
     col = "lightblue",
     border = "black")

# Overlay normal curve
x_seq <- seq(min(standardized, na.rm=TRUE), 
             max(standardized, na.rm=TRUE), 
             length.out = 100)
y_norm <- dnorm(x_seq, mean = mean(standardized, na.rm=TRUE), 
                sd = sd(standardized, na.rm=TRUE))
y_norm <- y_norm * length(standardized) * diff(range(standardized, na.rm=TRUE)) / 15
lines(x_seq, y_norm, col = "red", lwd = 2)

legend("topright", 
       legend = c("Observed", "Normal Distribution"),
       fill = c("lightblue", NA),
       border = c("black", NA),
       lty = c(NA, 1),
       col = c(NA, "red"),
       lwd = c(NA, 2))

Interpretation: - Bell-shaped: GOOD! Residuals are normally distributed [5] - Skewed (lopsided): Normality may be violated - Flat or too pointy: Kurtosis issues

6.3 Skewness and Kurtosis

Skewness: Measures asymmetry of the distribution [5] - Skewness ≈ 0: Symmetric (normal) - Skewness > 0: Positive skew (long right tail) - Skewness < 0: Negative skew (long left tail) - Rule of thumb: |Skewness| < 2 is acceptable [5]

Kurtosis: Measures “peakedness” of the distribution [5] - Excess Kurtosis ≈ 0: Normal peakedness - Excess Kurtosis > 0: Heavy tails (more outliers than normal) - Excess Kurtosis < 0: Light tails (fewer outliers than normal) - Rule of thumb: |Excess Kurtosis| < 7 is acceptable [5]

# Calculate skewness and kurtosis for each continuous variable
skew_kurt_df <- data.frame(
  Variable = names(cont_vars_for_cor),
  Skewness = sapply(cont_vars_for_cor, skewness, na.rm = TRUE),
  Excess_Kurtosis = sapply(cont_vars_for_cor, function(x) kurtosis(x, na.rm = TRUE) - 3),
  Skew_Status = ifelse(abs(sapply(cont_vars_for_cor, skewness, na.rm = TRUE)) < 2, 
                       "✓ OK", "⚠ CONCERN"),
  Kurt_Status = ifelse(abs(sapply(cont_vars_for_cor, function(x) kurtosis(x, na.rm = TRUE) - 3)) < 7,
                       "✓ OK", "⚠ CONCERN")
)

skew_kurt_df %>%
  kable(caption = "Skewness and Kurtosis for Continuous Variables", digits = 3) %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
  row_spec(which(skew_kurt_df$Skew_Status == "⚠ CONCERN" | 
                 skew_kurt_df$Kurt_Status == "⚠ CONCERN"), 
           bold = TRUE, background = "lightyellow")

Skewness and Kurtosis for Continuous Variables
	Variable	Skewness	Excess_Kurtosis	Skew_Status	Kurt_Status
Grade	Grade	0.122	-0.726	✓ OK	✓ OK
Absences	Absences	0.516	-1.032	✓ OK	✓ OK
RS1	RS1	-0.605	-0.507	✓ OK	✓ OK
RS2	RS2	-0.465	-0.695	✓ OK	✓ OK
RS3	RS3	0.075	-0.962	✓ OK	✓ OK
RS4	RS4	-0.392	-0.996	✓ OK	✓ OK
RS5	RS5	-0.123	-1.130	✓ OK	✓ OK
RS6	RS6	-0.373	-0.850	✓ OK	✓ OK
RS7	RS7	-0.619	-0.535	✓ OK	✓ OK
RS8	RS8	-0.186	-1.231	✓ OK	✓ OK
RS9	RS9	-0.302	-0.767	✓ OK	✓ OK
RS10	RS10	-0.638	-0.658	✓ OK	✓ OK
RS11	RS11	-0.298	-1.044	✓ OK	✓ OK
RS12	RS12	-1.057	0.269	✓ OK	✓ OK
RS13	RS13	-0.542	-0.522	✓ OK	✓ OK
RS14	RS14	-0.455	-0.761	✓ OK	✓ OK

6.3.1 Visualizing Skewness and Kurtosis

# Create a visual comparison
par(mfrow = c(2, 2))

# Normal distribution
set.seed(seeds[5])
normal_data <- rnorm(1000, 0, 1)
hist(normal_data, main = "Normal (Skew=0, Kurt=0)",
     xlab = "Value", col = "lightgreen", border = "black")

# Positive skew
pos_skew <- rexp(1000, 1)
hist(pos_skew, main = "Positive Skew (Skew>0)",
     xlab = "Value", col = "lightcoral", border = "black")

# Negative skew
neg_skew <- -rexp(1000, 1)
hist(neg_skew, main = "Negative Skew (Skew<0)",
     xlab = "Value", col = "lightcoral", border = "black")

# Heavy tails (high kurtosis)
heavy_tails <- rt(1000, df = 3)
hist(heavy_tails, main = "Heavy Tails (Kurt>0)",
     xlab = "Value", col = "lightyellow", border = "black")

par(mfrow = c(1, 1))

6.4 Shapiro-Wilk Test

What is it? A formal statistical test for normality [5].

Interpretation: - p > .05: Data are consistent with a normal distribution (GOOD) - p < .05: Data significantly differ from normal (CONCERN)

Warning [5]: In large samples, the Shapiro-Wilk test can be significant even for minor deviations from normality. Always interpret alongside visual checks (histograms, Q-Q plots).

# Shapiro-Wilk test on standardized residuals
shapiro_result <- shapiro.test(standardized)
print(shapiro_result)

## 
##  Shapiro-Wilk normality test
## 
## data:  standardized
## W = 0.95979, p-value = 0.001027

# Interpretation
if(shapiro_result$p.value > 0.05) {
  cat("\n✓ GOOD: Shapiro-Wilk p-value =", round(shapiro_result$p.value, 4), "\n")
  cat("  Residuals are consistent with a normal distribution.\n")
} else {
  cat("\n⚠ WARNING: Shapiro-Wilk p-value =", round(shapiro_result$p.value, 4), "\n")
  cat("  Residuals significantly differ from normal distribution.\n")
  if(nrow(noout) >= 30) {
    cat("  However, with N ≥ 30, Central Limit Theorem applies.\n")
    cat("  Check visual plots (Q-Q, histogram) for practical significance.\n")
  }
}

## 
## ⚠ WARNING: Shapiro-Wilk p-value = 0.001 
##   Residuals significantly differ from normal distribution.
##   However, with N ≥ 30, Central Limit Theorem applies.
##   Check visual plots (Q-Q, histogram) for practical significance.

Part 7: Testing Homogeneity and Homoscedasticity

7.1 What is Homogeneity of Variance?

Definition [7]: The assumption that variances are roughly equal across groups [7].

Plain English Example: Imagine comparing test scores from three classes. Homogeneity of variance means that the spread of scores (how much they vary) should be similar in all three classes. If one class has scores all between 80-90 (small variance) and another has scores between 40-100 (large variance), this assumption is violated [7].

7.2 What is Homoscedasticity?

Definition [8]: The spread of the variance of a variable is the same across all values of another variable [8].

Plain English Example: Imagine predicting income from years of education. Homoscedasticity means the variability in income should be roughly the same whether someone has 12 years or 20 years of education. If income becomes much more variable at higher education levels (some people make a lot, some don’t), that’s heteroscedasticity (violation of this assumption) [8].

7.2.0 Visual Concept: Homoscedasticity Explained

Key insight: - Top row: Original data with green/red boxes showing variance at different X values - Bottom row: Residual plots make patterns easier to see - Homoscedastic: Variance boxes are similar height; residuals evenly scattered - Heteroscedastic: Variance boxes differ in height; residuals show funnel or bow tie patterns [8]

7.2.1 Difference Between Homogeneity and Homoscedasticity

Homogeneity vs. Homoscedasticity [7, 8]
Aspect	Homogeneity	Homoscedasticity
Context	Categorical groups (ANOVA, t-test)	Continuous predictor (Regression)
What it checks	Equal variance across groups	Equal variance across predictor range
Visualization	Boxplots by group	Residual plot (fitted vs. residuals)
Test	Levene’s test	Visual inspection of residual plot

7.3 Visual Check: Residual Plot

# Create enhanced residual plot
plot_data <- data.frame(
  Fitted = as.numeric(fitvalues),
  Residuals = standardized,
  Outlier = abs(standardized) > 2
)

ggplot(plot_data, aes(x = Fitted, y = Residuals)) +
  geom_point(aes(color = Outlier), size = 2, alpha = 0.6) +
  geom_hline(yintercept = 0, color = "red", linewidth = 1) +
  geom_vline(xintercept = 0, color = "red", linewidth = 1) +
  geom_hline(yintercept = c(-2, 2), linetype = "dashed", color = "blue") +
  geom_smooth(method = "loess", se = TRUE, color = "darkgreen", linewidth = 1) +
  scale_color_manual(values = c("FALSE" = "black", "TRUE" = "red"),
                     labels = c("Within ±2", "Outside ±2")) +
  labs(title = "Residual Plot: Checking Homoscedasticity",
       subtitle = "Points should be randomly scattered with equal spread across X-axis",
       x = "Standardized Fitted Values",
       y = "Standardized Residuals",
       color = "Status") +
  annotate("text", x = min(plot_data$Fitted, na.rm=TRUE), y = 3,
           label = "✓ GOOD: Random scatter\n✗ BAD: Funnel, curve, or clusters",
           hjust = 0, vjust = 1, color = "darkgreen", size = 3.5) +
  theme_minimal() +
  theme(legend.position = "bottom")

What to look for:

Homogeneity (vertical): Is the spread above the horizontal line the same as below? [8]
- ✓ GOOD: Points evenly scattered above and below zero
- ✗ BAD: Large spread on one side, small on the other (looks like it’s raining)
Homoscedasticity (horizontal): Is the spread equal all the way across the X-axis? [8]
- ✓ GOOD: Even random spread of dots across the entire plot
- ✗ BAD: Megaphones (funnel shapes), triangles, or big groupings

7.3.1 Pattern Examples

7.4 Levene’s Test for Homogeneity

What is Levene’s test? A formal test for homogeneity of variance across groups [7].

Interpretation: - p > .05: Variances are not significantly different (GOOD) - p < .05: Variances are significantly different (CONCERN)

Note: Levene’s test requires categorical grouping variables. We’ll demonstrate with Sex and SES.

# Levene's test for Sex groups
# We'll test if RS1 variance differs by Sex
levene_sex <- leveneTest(noout$RS1, noout$Sex, center = median)
print(levene_sex)

## Levene's Test for Homogeneity of Variance (center = median)
##        Df F value Pr(>F)
## group   1  0.0257 0.8729
##       121

cat("\nInterpretation for Sex:\n")

## 
## Interpretation for Sex:

if(levene_sex$`Pr(>F)`[1] > 0.05) {
  cat("✓ GOOD: Levene's test p-value =", round(levene_sex$`Pr(>F)`[1], 4), "\n")
  cat("  Variances are not significantly different across Sex groups.\n")
} else {
  cat("⚠ WARNING: Levene's test p-value =", round(levene_sex$`Pr(>F)`[1], 4), "\n")
  cat("  Variances are significantly different across Sex groups.\n")
  cat("  Consider using Welch's correction or non-parametric tests.\n")
}

## ✓ GOOD: Levene's test p-value = 0.8729 
##   Variances are not significantly different across Sex groups.

# Levene's test for SES groups
levene_ses <- leveneTest(noout$RS1, noout$SES, center = median)
print(levene_ses)

## Levene's Test for Homogeneity of Variance (center = median)
##        Df F value Pr(>F)
## group   2  1.6957 0.1878
##       120

cat("\nInterpretation for SES:\n")

## 
## Interpretation for SES:

if(levene_ses$`Pr(>F)`[1] > 0.05) {
  cat("✓ GOOD: Levene's test p-value =", round(levene_ses$`Pr(>F)`[1], 4), "\n")
  cat("  Variances are not significantly different across SES groups.\n")
} else {
  cat("⚠ WARNING: Levene's test p-value =", round(levene_ses$`Pr(>F)`[1], 4), "\n")
  cat("  Variances are significantly different across SES groups.\n")
  cat("  Consider using Welch's correction or non-parametric tests.\n")
}

## ✓ GOOD: Levene's test p-value = 0.1878 
##   Variances are not significantly different across SES groups.

7.4.1 Visualizing Variance by Groups

# Create boxplots to visualize variance differences
par(mfrow = c(1, 2))

# RS1 by Sex
boxplot(RS1 ~ Sex, data = noout,
        main = "RS1 by Sex",
        xlab = "Sex",
        ylab = "RS1 Score",
        col = c("lightblue", "lightpink"))

# RS1 by SES
boxplot(RS1 ~ SES, data = noout,
        main = "RS1 by SES",
        xlab = "Socioeconomic Status",
        ylab = "RS1 Score",
        col = c("lightcoral", "lightyellow", "lightgreen"))

par(mfrow = c(1, 1))

Interpretation: If the boxes have similar heights (interquartile ranges), variances are roughly equal. If one box is much taller/shorter than others, homogeneity may be violated [7].

Part 8: Summary and Decision Tree

8.1 Assumption Checking Summary

Assumption Checking Summary [1]
Assumption	Test_Method	Good_Result	If_Violated
Independence	Durbin-Watson test	DW: 1.5-2.5, p > .05	Use mixed models or time-series methods
Additivity (Multicollinearity)	Correlation matrix, VIF	r < .80, VIF < 10	Remove/combine correlated predictors
Linearity	Q-Q plot, Residual plot	Points on Q-Q line, Random scatter	Transform variables or use non-linear models
Normality	Histogram, Q-Q plot, Shapiro-Wilk, Skew/Kurt	Bell-shaped, p > .05, \|Skew\| < 2, \|Kurt\| < 7	Transform data or use robust/non-parametric tests
Homogeneity	Levene’s test, Boxplots	p > .05, Similar box heights	Use Welch’s correction or non-parametric tests
Homoscedasticity	Residual plot (fitted vs. residuals)	Random scatter, Equal spread	Transform outcome or use robust standard errors

8.2 Decision Tree: What to Do When Assumptions Are Violated

8.3 Comprehensive Diagnostic Dashboard

Let’s create a single comprehensive view of all our assumption checks:

# Create a comprehensive diagnostic dashboard
par(mfrow = c(3, 4), mar = c(4, 4, 3, 2))

# Row 1: Independence & Multicollinearity
# 1. Residuals over time (Independence)
plot(1:length(standardized), standardized, type = "l", col = "steelblue", lwd = 1.5,
     main = "1. Independence Check",
     xlab = "Observation Order", ylab = "Standardized Residuals",
     ylim = c(-3, 3))
abline(h = 0, col = "red", lwd = 2)
abline(h = c(-2, 2), lty = 2, col = "gray")
# Add DW result
dw_stat <- durbinWatsonTest(fake)$dw
dw_status <- ifelse(dw_stat >= 1.5 & dw_stat <= 2.5, "✓ PASS", "⚠ CHECK")
text(length(standardized)/2, 2.5, 
     sprintf("DW = %.2f\n%s", dw_stat, dw_status),
     col = ifelse(dw_status == "✓ PASS", "darkgreen", "orange"),
     font = 2, cex = 1.1)

# 2. ACF plot
acf(standardized, main = "2. Autocorrelation Function",
    col = "steelblue", lwd = 2)

# 3. Correlation heatmap (simplified)
cor_subset <- cor(cont_vars_for_cor[, 1:min(6, ncol(cont_vars_for_cor))])
image(1:nrow(cor_subset), 1:ncol(cor_subset), cor_subset,
      col = colorRampPalette(c("blue", "white", "red"))(100),
      main = "3. Correlation Matrix",
      xlab = "", ylab = "", axes = FALSE)
axis(1, at = 1:nrow(cor_subset), labels = rownames(cor_subset), las = 2, cex.axis = 0.8)
axis(2, at = 1:ncol(cor_subset), labels = colnames(cor_subset), las = 2, cex.axis = 0.8)
max_cor <- max(abs(cor_subset[upper.tri(cor_subset)]))
cor_status <- ifelse(max_cor < 0.80, "✓ PASS", "⚠ CHECK")
text(nrow(cor_subset)/2, ncol(cor_subset) + 1,
     sprintf("Max r = %.2f\n%s", max_cor, cor_status),
     col = ifelse(cor_status == "✓ PASS", "darkgreen", "orange"),
     font = 2, cex = 1.1)

# 4. VIF bar plot (top 6)
vif_top <- head(vif_df[order(-vif_df$VIF), ], 6)
barplot(vif_top$VIF, names.arg = vif_top$Variable,
        col = ifelse(vif_top$VIF > 10, "red", 
                    ifelse(vif_top$VIF > 5, "orange", "lightgreen")),
        main = "4. VIF Values (Top 6)",
        ylab = "VIF", las = 2, cex.names = 0.7)
abline(h = 10, col = "red", lwd = 2, lty = 2)
abline(h = 5, col = "orange", lwd = 2, lty = 2)
max_vif <- max(vif_df$VIF)
vif_status <- ifelse(max_vif < 10, "✓ PASS", "⚠ CHECK")
text(3, max(vif_top$VIF) * 0.9,
     sprintf("Max VIF = %.1f\n%s", max_vif, vif_status),
     col = ifelse(vif_status == "✓ PASS", "darkgreen", "orange"),
     font = 2, cex = 1.1)

# Row 2: Linearity & Normality
# 5. Q-Q plot
qqnorm(standardized, main = "5. Q-Q Plot (Normality)",
       pch = 19, col = rgb(0, 0, 1, 0.5))
qqline(standardized, col = "red", lwd = 2)
outside_2 <- sum(abs(standardized) > 2, na.rm = TRUE) / length(standardized) * 100
qq_status <- ifelse(outside_2 <= 5, "✓ PASS", "⚠ CHECK")
text(-2, max(standardized, na.rm = TRUE) * 0.9,
     sprintf("%.1f%% outside ±2\n%s", outside_2, qq_status),
     col = ifelse(qq_status == "✓ PASS", "darkgreen", "orange"),
     font = 2, cex = 1.1, adj = 0)

# 6. Histogram of residuals
hist(standardized, breaks = 15, col = "lightblue", border = "white",
     main = "6. Histogram of Residuals",
     xlab = "Standardized Residuals", ylab = "Frequency")
curve(dnorm(x, mean(standardized, na.rm=TRUE), sd(standardized, na.rm=TRUE)) * 
      length(standardized) * diff(range(standardized, na.rm=TRUE)) / 15,
      add = TRUE, col = "red", lwd = 2)
shapiro_p <- shapiro.test(standardized)$p.value
shapiro_status <- ifelse(shapiro_p > 0.05, "✓ PASS", "⚠ CHECK")
text(mean(standardized, na.rm=TRUE), 
     max(hist(standardized, breaks = 15, plot = FALSE)$counts) * 0.9,
     sprintf("Shapiro p = %.3f\n%s", shapiro_p, shapiro_status),
     col = ifelse(shapiro_status == "✓ PASS", "darkgreen", "orange"),
     font = 2, cex = 1.1)

# 7. Residual plot (Linearity)
plot(fitvalues, standardized, pch = 19, col = rgb(0, 0, 0, 0.5),
     main = "7. Residual Plot (Linearity)",
     xlab = "Fitted Values", ylab = "Residuals")
abline(h = 0, col = "red", lwd = 2)
abline(h = c(-2, 2), lty = 2, col = "gray")
lines(lowess(fitvalues, standardized), col = "blue", lwd = 2)
text(min(fitvalues, na.rm=TRUE), 2.5,
     "Look for:\nRandom scatter",
     col = "darkgreen", font = 2, adj = 0, cex = 0.9)

# 8. Scale-Location plot
sqrt_abs_resid <- sqrt(abs(standardized))
plot(fitvalues, sqrt_abs_resid, pch = 19, col = rgb(0, 0, 0, 0.5),
     main = "8. Scale-Location\n(Homoscedasticity)",
     xlab = "Fitted Values", ylab = "√|Residuals|")
lines(lowess(fitvalues, sqrt_abs_resid), col = "red", lwd = 2)
text(min(fitvalues, na.rm=TRUE), max(sqrt_abs_resid, na.rm=TRUE) * 0.9,
     "Flat line = good",
     col = "darkgreen", font = 2, adj = 0, cex = 0.9)

# Row 3: Homogeneity & Summary
# 9. Boxplot by Sex
boxplot(RS1 ~ Sex, data = noout,
        col = c("lightblue", "lightpink"),
        main = "9. RS1 by Sex",
        ylab = "RS1 Score")
levene_sex_p <- leveneTest(noout$RS1, noout$Sex, center = median)$`Pr(>F)`[1]
levene_status <- ifelse(levene_sex_p > 0.05, "✓ PASS", "⚠ CHECK")
text(1.5, max(noout$RS1, na.rm=TRUE) * 0.95,
     sprintf("Levene p = %.3f\n%s", levene_sex_p, levene_status),
     col = ifelse(levene_status == "✓ PASS", "darkgreen", "orange"),
     font = 2, cex = 1.1)

# 10. Boxplot by SES
boxplot(RS1 ~ SES, data = noout,
        col = c("lightcoral", "lightyellow", "lightgreen"),
        main = "10. RS1 by SES",
        ylab = "RS1 Score")
levene_ses_p <- leveneTest(noout$RS1, noout$SES, center = median)$`Pr(>F)`[1]
levene_status2 <- ifelse(levene_ses_p > 0.05, "✓ PASS", "⚠ CHECK")
text(2, max(noout$RS1, na.rm=TRUE) * 0.95,
     sprintf("Levene p = %.3f\n%s", levene_ses_p, levene_status2),
     col = ifelse(levene_status2 == "✓ PASS", "darkgreen", "orange"),
     font = 2, cex = 1.1)

# 11. Cook's Distance
cooks_d <- cooks.distance(fake)
plot(1:length(cooks_d), cooks_d, type = "h", col = "steelblue", lwd = 2,
     main = "11. Cook's Distance",
     xlab = "Observation", ylab = "Cook's D")
abline(h = 4/length(cooks_d), col = "red", lwd = 2, lty = 2)
influential <- sum(cooks_d > 4/length(cooks_d))
text(length(cooks_d)/2, max(cooks_d) * 0.9,
     sprintf("%d influential\npoints", influential),
     col = ifelse(influential == 0, "darkgreen", "orange"),
     font = 2, cex = 1.1)

# 12. Summary Report
plot.new()
text(0.5, 0.95, "Assumption Check Summary", font = 2, cex = 1.5)

checks <- c(
  sprintf("1. Independence: %s", ifelse(dw_stat >= 1.5 & dw_stat <= 2.5, "✓", "⚠")),
  sprintf("2. Multicollinearity: %s", ifelse(max_vif < 10, "✓", "⚠")),
  sprintf("3. Linearity: Visual check"),
  sprintf("4. Normality: %s", ifelse(shapiro_p > 0.05, "✓", "⚠")),
  sprintf("5. Homogeneity: %s", ifelse(levene_sex_p > 0.05 & levene_ses_p > 0.05, "✓", "⚠"))
)

y_pos <- seq(0.75, 0.35, length.out = 5)
for(i in 1:5) {
  color <- ifelse(grepl("✓", checks[i]), "darkgreen", "orange")
  text(0.1, y_pos[i], checks[i], adj = 0, cex = 1.2, col = color, font = 2)
}

text(0.5, 0.15, sprintf("Sample Size: N = %d", nrow(noout)),
     cex = 1.2, font = 2)

overall_status <- sum(grepl("✓", checks))
text(0.5, 0.05,
     sprintf("Overall: %d/5 checks passed", overall_status),
     cex = 1.3, font = 2,
     col = ifelse(overall_status >= 4, "darkgreen", "orange"))

par(mfrow = c(1, 1))

How to read this dashboard:

Plots 1-4: Independence and Multicollinearity checks
Plots 5-8: Normality and Linearity checks
Plots 9-11: Homogeneity and influential points
Plot 12: Summary scorecard

Status indicators: - ✓ PASS (green) = Assumption met - ⚠ CHECK (orange) = Potential violation, investigate further

8.4 What to Do When Assumptions Are Violated

8.4.1 Common Data Transformations

When assumptions are violated, transformations can often help [1]. Here are the most common transformations and when to use them:

When to use each transformation [1]:

Transformation Guide for Common Violations [1]
Problem	Transformation	R_Code
Positive skew (mild)	Square root: sqrt(x)	sqrt(variable)
Positive skew (moderate)	Log: log(x) or log(x + 1)	log(variable + 1)
Positive skew (severe)	Reciprocal: 1/x or 1/(x + 1)	1 / (variable + 1)
Negative skew	Reflect then transform: max(x) - x, then apply above	max(variable) - variable
Heteroscedasticity	Log outcome variable	log(outcome + 1)
Non-linearity	Add polynomial terms or use non-linear model	lm(y ~ x + I(x^2))

Important notes [1]: - Always add a small constant (like 1) before log/reciprocal if data contain zeros - Transformations change the interpretation of your results - Report both original and transformed statistics - If transformations don’t help, consider robust or non-parametric methods

8.4.2 Alternative Approaches When Assumptions Fail

Solutions for Assumption Violations [1]
Violated_Assumption	Parametric_Fix	Non_Parametric_Alternative
Independence	Mixed models, time-series models	Account for clustering structure
Multicollinearity	Remove/combine predictors, PCA	Ridge/Lasso regression
Linearity	Add polynomial terms, splines	GAM (Generalized Additive Models)
Normality (small N)	Transform data, bootstrap	Mann-Whitney, Kruskal-Wallis
Normality (large N)	Proceed (CLT applies)	Standard tests are robust
Homogeneity	Welch’s correction	Mann-Whitney, Kruskal-Wallis
Homoscedasticity	Weighted least squares, robust SE	Rank-based tests

8.5 Key Takeaways

Based on Field et al. [1]:

Always check assumptions before interpreting statistical tests
Use multiple methods (visual + statistical tests) to assess each assumption
Sample size matters: Large samples (N ≥ 30) are more forgiving of normality violations [6]
Prioritize visual checks: Statistical tests can be overly sensitive in large samples [5]
Independence is critical: Violations of independence are more serious than other assumptions [1]
Transformations can help: Log, square root, or reciprocal transformations often fix violations [1]
Robust methods exist: When assumptions fail, consider robust or non-parametric alternatives [1]
Document your decisions: Always report which assumptions you checked and how [1]
Influential cases matter: Check Cook’s Distance and leverage to identify cases that unduly influence the model [1]
Context is key: Statistical significance doesn’t always mean practical importance - use your judgment [1]

Part 9: Practice Exercises

Exercise 1: Complete Assumption Check

Using the mtcars dataset:

Create a fake regression predicting a random variable from all other variables
Check all five assumptions (independence, additivity, linearity, normality, homogeneity)
Create a report summarizing which assumptions are met and which are violated

# Your code here
data(mtcars)

# Create fake regression
random_y <- rchisq(nrow(mtcars), 7)
fake_model <- lm(random_y ~ ., data = mtcars)

# Extract diagnostics
std_resid <- rstudent(fake_model)
fitted_vals <- scale(fake_model$fitted.values)

# 1. Independence: Durbin-Watson
durbinWatsonTest(fake_model)

# 2. Multicollinearity: VIF
vif(fake_model)

# 3. Linearity: Q-Q plot
qqnorm(std_resid)
qqline(std_resid)

# 4. Normality: Shapiro-Wilk
shapiro.test(std_resid)

# 5. Homoscedasticity: Residual plot
plot(fitted_vals, std_resid)
abline(h = 0, col = "red")

Exercise 2: Identify Assumption Violations

Given the following residual plots, identify which assumption is violated:

Questions: - Plot A: Which assumption is violated? - Plot B: Which assumption is violated? - Plot C: Are any assumptions violated? - Plot D: What might this pattern indicate?

Answers: - Plot A: Homoscedasticity (funnel pattern - variance increases with fitted values) - Plot B: Linearity (curved pattern - non-linear relationship) - Plot C: None - random scatter indicates assumptions are met - Plot D: Missing predictor or two distinct groups not accounted for in the model

Exercise 3: Interpret VIF Values

Given the following VIF values, which variables should you be concerned about?

VIF Values for Exercise 3
Variable	VIF
Age	2.3
Income	15.7
Education	3.1
Experience	14.2
Salary	18.9

Question: Which variables have problematic multicollinearity? What would you do?

Answer: Income (VIF = 15.7), Experience (VIF = 14.2), and Salary (VIF = 18.9) all exceed the threshold of 10, indicating serious multicollinearity [2]. These variables likely measure similar constructs (e.g., Income and Salary are probably highly correlated). Solution: Remove one or two of these variables, or combine them into a single “economic status” composite variable.

Glossary

Additivity: The assumption that predictors contribute independently to the outcome; violated when multicollinearity is present [2].

Central Limit Theorem: States that in large samples (N > 30), the sampling distribution tends to be normal regardless of the population distribution [6].

Durbin-Watson test: Tests for autocorrelation in residuals; values near 2 indicate independence [1].

Fake regression: A regression model created solely for diagnostic purposes, not to test hypotheses [1].

Homogeneity of variance: Equal variances across categorical groups; tested with Levene’s test [7].

Homoscedasticity: Equal variance of residuals across the range of predicted values [8].

Independence: The assumption that observations (or residuals) are not correlated with each other [1].

Kurtosis: Measure of the “peakedness” of a distribution; excess kurtosis = 0 for normal distributions [5].

Levene’s test: Statistical test for homogeneity of variance across groups [7].

Linearity: The assumption that relationships between variables are linear (straight lines) rather than curved [3].

Multicollinearity: High correlation between predictor variables (r > .80 or .90); makes coefficients unstable [2].

Normality: The assumption that the sampling distribution (not necessarily raw data) is normally distributed [6].

Q-Q plot: Quantile-quantile plot comparing observed data to theoretical normal distribution [5].

Residual: The difference between observed and predicted values; also called “error” [4].

Shapiro-Wilk test: Statistical test for normality; significant p-values indicate non-normality [5].

Skewness: Measure of asymmetry in a distribution; 0 = symmetric, positive = right skew, negative = left skew [5].

Standardized residual: Residual converted to z-score (mean = 0, SD = 1) for easier interpretation [4].

Variance Inflation Factor (VIF): Measure of multicollinearity; VIF > 10 indicates serious problems [2].

References

[1] A. Field, J. Miles, and Z. Field, Discovering Statistics Using R. London: SAGE Publications, 2012.

[2] A. Field, J. Miles, and Z. Field, Discovering Statistics Using R. London, UK: SAGE Publications, 2012, pp. 274-276. [Multicollinearity and additivity]

[3] A. Field, J. Miles, and Z. Field, Discovering Statistics Using R. London, UK: SAGE Publications, 2012, pp. 293-295. [Linearity assumption and residual plots]

[4] A. Field, J. Miles, and Z. Field, Discovering Statistics Using R. London, UK: SAGE Publications, 2012, pp. 269-271. [Standardized residuals and error distributions]

[5] A. Field, J. Miles, and Z. Field, Discovering Statistics Using R. London, UK: SAGE Publications, 2012, pp. 179-182. [Q-Q plots and assessing normality]

[6] A. Field, J. Miles, and Z. Field, Discovering Statistics Using R. London, UK: SAGE Publications, 2012, pp. 168-169. [Central Limit Theorem and normality assumption]

[7] A. Field, J. Miles, and Z. Field, Discovering Statistics Using R. London, UK: SAGE Publications, 2012, pp. 185-188. [Homogeneity of variance and Levene’s test]

[8] A. Field, J. Miles, and Z. Field, Discovering Statistics Using R. London, UK: SAGE Publications, 2012, pp. 272-273, 293. [Homoscedasticity and residual plots]

Session Information

sessionInfo()

## R version 4.5.2 (2025-10-31 ucrt)
## Platform: x86_64-w64-mingw32/x64
## Running under: Windows 11 x64 (build 26200)
## 
## Matrix products: default
##   LAPACK version 3.12.1
## 
## locale:
## [1] LC_COLLATE=English_United States.utf8 
## [2] LC_CTYPE=English_United States.utf8   
## [3] LC_MONETARY=English_United States.utf8
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.utf8    
## 
## time zone: America/New_York
## tzcode source: internal
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] DiagrammeR_1.0.11 kableExtra_1.4.0  knitr_1.50        corrplot_0.95    
##  [5] moments_0.14.1    car_3.1-3         carData_3.0-5     mice_3.19.0      
##  [9] rio_1.2.4         lubridate_1.9.4   forcats_1.0.1     stringr_1.6.0    
## [13] dplyr_1.1.4       purrr_1.2.0       readr_2.1.6       tidyr_1.3.1      
## [17] tibble_3.3.0      ggplot2_4.0.1     tidyverse_2.0.0   seedhash_0.1.0   
## 
## loaded via a namespace (and not attached):
##  [1] tidyselect_1.2.1   viridisLite_0.4.2  farver_2.1.2       R.utils_2.13.0    
##  [5] S7_0.2.1           fastmap_1.2.0      digest_0.6.37      rpart_4.1.24      
##  [9] timechange_0.3.0   lifecycle_1.0.4    survival_3.8-3     magrittr_2.0.4    
## [13] compiler_4.5.2     rlang_1.1.6        sass_0.4.10        tools_4.5.2       
## [17] yaml_2.3.10        data.table_1.17.8  labeling_0.4.3     htmlwidgets_1.6.4 
## [21] xml2_1.5.1         RColorBrewer_1.1-3 abind_1.4-8        withr_3.0.2       
## [25] R.oo_1.27.1        nnet_7.3-20        grid_4.5.2         jomo_2.7-6        
## [29] scales_1.4.0       iterators_1.0.14   MASS_7.3-65        cli_3.6.5         
## [33] rmarkdown_2.30     reformulas_0.4.2   generics_0.1.4     rstudioapi_0.17.1 
## [37] tzdb_0.5.0         visNetwork_2.1.4   minqa_1.2.8        cachem_1.1.0      
## [41] splines_4.5.2      vctrs_0.6.5        boot_1.3-32        glmnet_4.1-10     
## [45] Matrix_1.7-4       jsonlite_2.0.0     hms_1.1.4          mitml_0.4-5       
## [49] Formula_1.2-5      systemfonts_1.3.1  foreach_1.5.2      jquerylib_0.1.4   
## [53] glue_1.8.0         nloptr_2.2.1       pan_1.9            codetools_0.2-20  
## [57] stringi_1.8.7      shape_1.4.6.1      gtable_0.3.6       lme4_1.1-38       
## [61] pillar_1.11.1      htmltools_0.5.8.1  R6_2.6.1           textshaping_1.0.4 
## [65] Rdpack_2.6.4       evaluate_1.0.5     lattice_0.22-7     R.methodsS3_1.8.2 
## [69] rbibutils_2.4      backports_1.5.0    broom_1.0.11       bslib_0.9.0       
## [73] Rcpp_1.1.0         svglite_2.2.2      nlme_3.1-168       mgcv_1.9-3        
## [77] xfun_0.54          pkgconfig_2.0.3

End of Week 07: R for Data Analytics Tutorial

ANLY 500 - Analytics I

Harrisburg University

Week 07: R for Data Analytics (Data Screening Part 2 - Assumptions)

Ziyuan Huang

Last Updated: January 08, 2026