This Week 04 tutorial builds on Weeks 02-03 and focuses on
statistical inference and effect sizes
in R. This tutorial upgrades the lecture concepts from
04_introDA_3.rmd into hands-on, code-first workflows:
Reference: This tutorial draws heavily from Field, Miles, and Field [1], Discovering Statistics Using R, with IEEE-style citations throughout.
Upgrade note: This hands-on tutorial extends the Week 04 lecture slides in
04_introDA_3.rmdwith code-first workflows and beginner-friendly explanations.
install.packages("X") once, then
library(X).By the end of this tutorial, you will be able to:
# Install if needed
# install.packages(c("tidyverse", "psych", "effectsize", "pwr", "MOTE", "kableExtra", "moments"))
library(tidyverse)
library(dplyr)
library(tidyr)
library(ggplot2)
library(psych)
library(knitr)
library(kableExtra)
library(moments)
# Optional packages for effect sizes and power
if (requireNamespace("effectsize", quietly = TRUE)) library(effectsize)
if (requireNamespace("pwr", quietly = TRUE)) library(pwr)
if (requireNamespace("MOTE", quietly = TRUE)) library(MOTE)
cat("\n=== REPRODUCIBLE SEED INFORMATION ===")##
## === REPRODUCIBLE SEED INFORMATION ===
##
## Generator Name: Week04 Statistical Inference - ANLY500
##
## MD5 Hash: 7f1e7bba7e5c80765f0d073729b59e76
##
## Available Seeds: 469688551, -919646717, 521158921, -608947294, 124076639 ...
What is this equation? Field et al. [1, Ch. 2] explain that everything in statistics essentially boils down to one simple idea:
\[Outcome_i = Model + Error_i\]
In plain English: The data we observe can be predicted from: 1. A model we choose to fit (e.g., the mean, a regression line) 2. Some amount of error (the difference between what we predicted and what we actually observed)
Why does this matter? Every statistical technique—from simple averages to complex machine learning—follows this pattern. Understanding it helps you see the unity across all of statistics.
The simplest statistical model is the mean (average). It’s a single number that tries to represent an entire dataset [1, Ch. 2].
Example from Field et al. [1]: Consider 5 statistics lecturers and their number of friends: 1, 2, 3, 3, 4
# The lecturer friends example from Field et al. [1]
lecturer_friends <- c(1, 2, 3, 3, 4)
# Calculate the mean
mean_friends <- mean(lecturer_friends)
cat("Mean number of friends:", mean_friends, "\n")## Mean number of friends: 2.6
Key insight [1, Ch. 2]: The mean of 2.6 friends is impossible in reality (you can’t have 0.6 of a friend!). This illustrates that the mean is a model—a hypothetical value created to summarize our data, not an actual observation.
# Create a data frame for visualization
lecturer_df <- data.frame(
lecturer = 1:5,
friends = lecturer_friends,
model = mean_friends
)
# Calculate the errors (deviations from the mean)
lecturer_df$error <- lecturer_df$friends - lecturer_df$model
# Display the data
lecturer_df %>%
kable(caption = "Lecturer Friends: Observed Data, Model (Mean), and Errors",
digits = 2) %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| lecturer | friends | model | error |
|---|---|---|---|
| 1 | 1 | 2.6 | -1.6 |
| 2 | 2 | 2.6 | -0.6 |
| 3 | 3 | 2.6 | 0.4 |
| 4 | 3 | 2.6 | 0.4 |
| 5 | 4 | 2.6 | 1.4 |
# Visualize
ggplot(lecturer_df, aes(x = lecturer, y = friends)) +
geom_point(size = 4, color = "steelblue") +
geom_hline(yintercept = mean_friends, color = "red", linewidth = 1.5, linetype = "dashed") +
geom_segment(aes(xend = lecturer, yend = model), color = "darkgreen", linewidth = 1) +
annotate("text", x = 5.3, y = mean_friends, label = paste("Mean =", mean_friends),
color = "red", hjust = 0) +
labs(title = "The Mean as a Statistical Model",
subtitle = "Red line = Model (mean); Green lines = Errors (deviations)",
x = "Lecturer", y = "Number of Friends") +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5, face = "bold"))Interpretation: - The red dashed line is our model (the mean = 2.6) - The green vertical lines show the errors—how far each observation is from the model - Some errors are positive (above the mean), some are negative (below)
The problem: If we just add up all the errors (deviations), they cancel out to zero! [1, Ch. 2]
# Calculate deviations from the mean
deviations <- lecturer_friends - mean_friends
cat("Individual deviations:", deviations, "\n")## Individual deviations: -1.6 -0.6 0.4 0.4 1.4
## Sum of deviations: -4.440892e-16
The solution: Square each deviation before adding them up. Squaring makes all values positive [1, Ch. 2].
What is SS? The Sum of Squared Errors (also called Sum of Squares) measures the total amount of error in our model [1, Ch. 2].
\[SS = \sum_{i=1}^{n}(x_i - \bar{x})^2\]
In plain English: 1. Calculate how far each value is from the mean (deviation) 2. Square each deviation (removes negatives) 3. Add them all up
# Step-by-step SS calculation
deviations <- lecturer_friends - mean_friends
squared_deviations <- deviations^2
SS <- sum(squared_deviations)
# Display the calculation
ss_df <- data.frame(
Value = lecturer_friends,
Mean = mean_friends,
Deviation = deviations,
Squared_Deviation = squared_deviations
)
ss_df %>%
kable(caption = "Step-by-Step Sum of Squares Calculation",
digits = 2) %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Value | Mean | Deviation | Squared_Deviation |
|---|---|---|---|
| 1 | 2.6 | -1.6 | 2.56 |
| 2 | 2.6 | -0.6 | 0.36 |
| 3 | 2.6 | 0.4 | 0.16 |
| 3 | 2.6 | 0.4 | 0.16 |
| 4 | 2.6 | 1.4 | 1.96 |
##
## Sum of Squared Errors (SS) = 5.2
Interpretation: The SS of 5.20 tells us there IS error in our model, even though the raw deviations summed to zero.
The problem with SS: It depends on how much data you have. More data points = larger SS, even if the model fits equally well [1, Ch. 2].
The solution: Calculate the average squared error by dividing SS by the degrees of freedom.
\[Variance (s^2) = \frac{SS}{N-1} = \frac{\sum_{i=1}^{n}(x_i - \bar{x})^2}{N-1}\]
Why N-1 instead of N? This is where degrees of freedom come in!
What are degrees of freedom (df)? Field et al. [1, Ch. 2] use a brilliant analogy:
Imagine you’re a rugby team manager with 15 empty position slots. When the first player arrives, you have 15 choices for their position. When the second arrives, you have 14 choices. Continue until 14 positions are filled. When the 15th player arrives, you have NO choice—only one position remains!
Degrees of freedom = 15 - 1 = 14
In statistics: When we use the sample mean to estimate the population mean, we “use up” one degree of freedom. Only N-1 values are free to vary while keeping that mean constant [1, Ch. 2].
# Calculate variance
n <- length(lecturer_friends)
df <- n - 1
variance <- SS / df
cat("Sample size (N):", n, "\n")## Sample size (N): 5
## Degrees of freedom (df = N-1): 4
## Variance (SS/df): 1.3
## R's var() function: 1.3
Interpretation: The variance of 1.3 is in “friends squared” units—which doesn’t make intuitive sense!
The problem with variance: It’s in squared units (friends², dollars², etc.)—hard to interpret [1, Ch. 2].
The solution: Take the square root to get the Standard Deviation (SD).
\[SD = \sqrt{Variance} = \sqrt{\frac{SS}{N-1}}\]
# Calculate standard deviation
std_dev <- sqrt(variance)
cat("Standard Deviation:", round(std_dev, 2), "friends\n")## Standard Deviation: 1.14 friends
## R's sd() function: 1.14
Interpretation: On average, the number of friends deviates from the mean by about 1.14 friends. This is now in interpretable units!
According to Field et al. [1, Ch. 2]:
# Compare two distributions with same mean but different SDs
set.seed(seeds[2])
# Consistent lecturer (small SD)
consistent <- rnorm(100, mean = 3, sd = 0.5)
# Variable lecturer (large SD)
variable <- rnorm(100, mean = 3, sd = 1.5)
# Create comparison data
comparison_df <- data.frame(
Rating = c(consistent, variable),
Lecturer = rep(c("Consistent (SD = 0.5)", "Variable (SD = 1.5)"), each = 100)
)
# Visualize
ggplot(comparison_df, aes(x = Rating, fill = Lecturer)) +
geom_histogram(bins = 20, alpha = 0.7, position = "identity") +
geom_vline(xintercept = 3, color = "red", linewidth = 1.5, linetype = "dashed") +
facet_wrap(~Lecturer, ncol = 1) +
labs(title = "Same Mean, Different Standard Deviations",
subtitle = "The mean (red line) represents both datasets equally, but with different accuracy",
x = "Rating", y = "Frequency") +
theme_minimal() +
theme(legend.position = "none",
plot.title = element_text(hjust = 0.5, face = "bold"))Key insight [1, Ch. 2]: Both distributions have the same mean (3.0), but the consistent lecturer’s ratings cluster tightly around it (small SD), while the variable lecturer’s ratings are spread out (large SD). The mean is a better model for the consistent lecturer!
Let’s apply these concepts to real data using R’s apply
family of functions and modern tidyverse approaches.
# Using the built-in quakes dataset
data(quakes)
# Traditional apply approach
cat("=== Using apply() ===\n")## === Using apply() ===
## lat long depth mag stations
## -20.64 179.46 311.37 4.62 33.42
##
## === Using lapply() ===
## $mag
## mean sd
## 4.620400 0.402773
##
## $depth
## mean sd
## 311.3710 215.5355
##
## === Using sapply() ===
## lat long depth mag stations
## -20.64 179.46 311.37 4.62 33.42
##
## === Using tapply() - Mean magnitude by station count ===
## 10 11 12 13 14 15 16 17
## 4.230000 4.228571 4.196000 4.333333 4.276923 4.282353 4.274286 4.350000
## 18 19
## 4.436364 4.379310
The tidyverse provides more readable alternatives [1, Ch. 3]:
# Same calculations using tidyverse (more readable)
quakes_summary <- quakes %>%
summarise(across(everything(),
list(mean = mean, sd = sd, var = var),
.names = "{.col}_{.fn}")) %>%
pivot_longer(everything(),
names_to = c("variable", "statistic"),
names_sep = "_") %>%
pivot_wider(names_from = statistic, values_from = value) %>%
mutate(across(where(is.numeric), ~round(.x, 2)))
quakes_summary %>%
kable(caption = "Quakes Dataset: Descriptive Statistics") %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| variable | mean | sd | var |
|---|---|---|---|
| lat | -20.64 | 5.03 | 25.29 |
| long | 179.46 | 6.07 | 36.84 |
| depth | 311.37 | 215.54 | 46455.55 |
| mag | 4.62 | 0.40 | 0.16 |
| stations | 33.42 | 21.90 | 479.63 |
This is one of the most commonly confused concepts in statistics! Field et al. [1, Ch. 2] clarify:
| Concept | Measures | Question It Answers |
|---|---|---|
| Standard Deviation (SD) | Spread of individual scores around the sample mean | “How much do individual data points vary?” |
| Standard Error (SE) | Spread of sample means around the population mean | “How much would the sample mean vary if we took many samples?” |
The problem: We usually can’t access the entire population. We only have a sample [1, Ch. 2].
The insight: If we took many samples from the same population, each sample would have a slightly different mean (this is called sampling variation).
The solution: The Standard Error tells us how much those sample means would typically vary.
\[SE = \frac{s}{\sqrt{N}}\]
Where: - s = sample standard deviation - N = sample size
# Calculate SE for quakes magnitude
mag_mean <- mean(quakes$mag)
mag_sd <- sd(quakes$mag)
mag_n <- length(quakes$mag)
mag_se <- mag_sd / sqrt(mag_n)
cat("Sample Mean:", round(mag_mean, 3), "\n")## Sample Mean: 4.62
## Sample SD: 0.403
## Sample Size: 1000
## Standard Error: 0.0127
According to Field et al. [1, Ch. 2]:
Key insight: SE gets smaller as sample size increases! This is why larger samples give more precise estimates.
# Demonstrate how SE decreases with sample size
set.seed(seeds[3])
population <- rnorm(10000, mean = 50, sd = 10)
sample_sizes <- c(10, 30, 50, 100, 500, 1000)
se_by_n <- sapply(sample_sizes, function(n) {
sample_data <- sample(population, n)
sd(sample_data) / sqrt(n)
})
se_df <- data.frame(
Sample_Size = sample_sizes,
Standard_Error = round(se_by_n, 3)
)
se_df %>%
kable(caption = "Standard Error Decreases with Sample Size") %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Sample_Size | Standard_Error |
|---|---|
| 10 | 4.287 |
| 30 | 2.075 |
| 50 | 1.240 |
| 100 | 0.963 |
| 500 | 0.441 |
| 1000 | 0.334 |
# Visualize
ggplot(se_df, aes(x = Sample_Size, y = Standard_Error)) +
geom_line(color = "steelblue", linewidth = 1.5) +
geom_point(size = 3, color = "darkblue") +
labs(title = "Standard Error Decreases with Sample Size",
subtitle = "Larger samples provide more precise estimates of the population mean",
x = "Sample Size (N)", y = "Standard Error") +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5, face = "bold"))What is a sampling distribution? It’s the distribution of sample means from many samples of the same population [1, Ch. 2].
The Central Limit Theorem (CLT) states [1, Ch. 2]: 1. As sample size increases (N > 30), the sampling distribution approaches a normal distribution 2. The mean of the sampling distribution equals the population mean (μ) 3. The SD of the sampling distribution equals σ/√N (the standard error)
# Demonstrate the Central Limit Theorem
set.seed(seeds[4])
# Population (skewed - NOT normal)
population <- rexp(100000, rate = 0.5) # Exponential distribution
pop_mean <- mean(population)
# Take many samples of different sizes and calculate means
simulate_sampling <- function(n, n_samples = 1000) {
sample_means <- replicate(n_samples, mean(sample(population, n)))
return(sample_means)
}
samples_10 <- simulate_sampling(10)
samples_30 <- simulate_sampling(30)
samples_100 <- simulate_sampling(100)
# Create visualization data
sampling_df <- data.frame(
Mean = c(samples_10, samples_30, samples_100),
Sample_Size = factor(rep(c("N = 10", "N = 30", "N = 100"), each = 1000),
levels = c("N = 10", "N = 30", "N = 100"))
)
# Plot
ggplot(sampling_df, aes(x = Mean)) +
geom_histogram(bins = 30, fill = "steelblue", color = "white", alpha = 0.8) +
geom_vline(xintercept = pop_mean, color = "red", linewidth = 1.5, linetype = "dashed") +
facet_wrap(~Sample_Size, ncol = 1, scales = "free_y") +
labs(title = "Central Limit Theorem in Action",
subtitle = "Even from a skewed population, sampling distributions become normal with larger N",
x = "Sample Mean", y = "Frequency") +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5, face = "bold"))Key insight: The original population was heavily skewed (exponential), but the sampling distribution of means becomes more normal as sample size increases!
Definition [1, Ch. 2]: A confidence interval is a range of values within which we believe the population parameter falls, with a specified probability (usually 95%).
Common misconception: A 95% CI does NOT mean “there’s a 95% chance the true mean is in this interval.” Instead, it means: “If we repeated this study 100 times, about 95 of those CIs would contain the true population mean” [1, Ch. 2].
Field et al. [1, Ch. 2] use a memorable example: - Sample mean sperm release = 17 million - True population mean (unknown to researchers) = 15 million - Wide interval (12-22 million): Contains true value ✓ - Narrow interval (16-18 million): Misses true value ✗
The trade-off: Narrower CIs are more precise but may miss the true value more often.
The formula for a 95% CI [1, Ch. 2]:
\[CI_{95\%} = \bar{X} \pm (z \times SE)\]
Where: - \(\bar{X}\) = sample mean - z = 1.96 for 95% CI (from normal distribution) - SE = standard error
For small samples (N < 30), use the t-distribution instead of z.
# Calculate 95% CI for quakes magnitude
mag_mean <- mean(quakes$mag)
mag_se <- sd(quakes$mag) / sqrt(length(quakes$mag))
# For large samples, z = 1.96 for 95% CI
z_95 <- 1.96
ci_lower <- mag_mean - z_95 * mag_se
ci_upper <- mag_mean + z_95 * mag_se
cat("=== 95% Confidence Interval for Quake Magnitude ===\n")## === 95% Confidence Interval for Quake Magnitude ===
## Sample Mean: 4.62
## Standard Error: 0.0127
## 95% CI: [ 4.595 , 4.645 ]
##
## Interpretation: We are 95% confident that the true population mean
## earthquake magnitude falls between 4.595 and 4.645
# Calculate CIs for all numeric variables
quakes_ci <- quakes %>%
summarise(across(everything(),
list(
mean = mean,
se = ~sd(.x)/sqrt(length(.x)),
ci_lower = ~mean(.x) - 1.96 * sd(.x)/sqrt(length(.x)),
ci_upper = ~mean(.x) + 1.96 * sd(.x)/sqrt(length(.x))
),
.names = "{.col}_{.fn}")) %>%
pivot_longer(everything(),
names_to = c("variable", "statistic"),
names_sep = "_(?=[^_]+$)") %>%
pivot_wider(names_from = statistic, values_from = value) %>%
mutate(across(where(is.numeric), ~round(.x, 2)))
quakes_ci %>%
kable(caption = "95% Confidence Intervals for Quakes Dataset Variables") %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| variable | mean | se | lower | upper |
|---|---|---|---|---|
| lat | -20.64 | 0.16 | NA | NA |
| lat_ci | NA | NA | -20.95 | -20.33 |
| long | 179.46 | 0.19 | NA | NA |
| long_ci | NA | NA | 179.09 | 179.84 |
| depth | 311.37 | 6.82 | NA | NA |
| depth_ci | NA | NA | 298.01 | 324.73 |
| mag | 4.62 | 0.01 | NA | NA |
| mag_ci | NA | NA | 4.60 | 4.65 |
| stations | 33.42 | 0.69 | NA | NA |
| stations_ci | NA | NA | 32.06 | 34.78 |
# Visualize CIs by region (using mtcars as example)
data(mtcars)
# Calculate CI for MPG by cylinder count
mpg_ci <- mtcars %>%
group_by(cyl) %>%
summarise(
n = n(),
mean = mean(mpg),
sd = sd(mpg),
se = sd / sqrt(n),
ci_lower = mean - 1.96 * se,
ci_upper = mean + 1.96 * se
)
mpg_ci %>%
kable(caption = "MPG by Cylinder Count with 95% CIs", digits = 2) %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| cyl | n | mean | sd | se | ci_lower | ci_upper |
|---|---|---|---|---|---|---|
| 4 | 11 | 26.66 | 4.51 | 1.36 | 24.00 | 29.33 |
| 6 | 7 | 19.74 | 1.45 | 0.55 | 18.67 | 20.82 |
| 8 | 14 | 15.10 | 2.56 | 0.68 | 13.76 | 16.44 |
# Visualize with error bars
ggplot(mpg_ci, aes(x = factor(cyl), y = mean)) +
geom_col(fill = "steelblue", alpha = 0.7) +
geom_errorbar(aes(ymin = ci_lower, ymax = ci_upper),
width = 0.2, linewidth = 1) +
labs(title = "Mean MPG by Cylinder Count with 95% Confidence Intervals",
subtitle = "Non-overlapping CIs suggest significant differences",
x = "Number of Cylinders", y = "Miles Per Gallon (MPG)") +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5, face = "bold"))Interpretation: When CIs don’t overlap, it suggests the groups are significantly different [1, Ch. 2]. Here, 4-cylinder cars have significantly higher MPG than 6- or 8-cylinder cars.
Field et al. [1, Ch. 2] recount Fisher’s famous example:
A woman claimed she could tell whether milk or tea was added first to a cup. Fisher tested her: - With 2 cups: 50% chance of guessing correctly (not convincing!) - With 6 cups: Only 5% chance of guessing correctly (1 in 20) Only when there’s a very small probability of success by chance should we believe it’s genuine.
This is the foundation of significance testing!
Null Hypothesis (H₀): No effect exists; any observed difference is due to chance [1, Ch. 2].
Alternative Hypothesis (H₁): An effect does exist; the observed difference is real [1, Ch. 2].
Decision rules: - If p < 0.05: Reject H₀ → “statistically significant” - If p ≥ 0.05: Fail to reject H₀ → “not statistically significant”
Important [1, Ch. 2]: We never “accept” the null or “prove” the alternative. We only reject or fail to reject.
Fisher [2] suggested we should be 95% confident before believing a result is genuine. But Field et al. [1, Ch. 2] note an important historical fact:
Fisher himself warned: “No scientific worker has a fixed level of significance at which from year to year, and in all circumstances, he rejects hypotheses; he rather gives his mind to each particular case in the light of his evidence and his ideas.”
The .05 cutoff became popular because Fisher published tables for these specific values—it was practical, not principled [1, Ch. 2]!
Question: Is the mean earthquake magnitude different from 4.5?
# One-sample t-test
# H₀: μ = 4.5 (mean magnitude equals 4.5)
# H₁: μ ≠ 4.5 (mean magnitude differs from 4.5)
test_value <- 4.5
one_t <- t.test(quakes$mag, mu = test_value)
cat("=== ONE-SAMPLE T-TEST ===\n")## === ONE-SAMPLE T-TEST ===
## H₀: Mean magnitude = 4.5
## H₁: Mean magnitude ≠ 4.5
##
## One Sample t-test
##
## data: quakes$mag
## t = 9.4529, df = 999, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 4.5
## 95 percent confidence interval:
## 4.595406 4.645394
## sample estimates:
## mean of x
## 4.6204
##
## === INTERPRETATION ===
## Sample mean: 4.62
## Test statistic (t): 9.453
## p-value: < 2.2e-16
## 95% CI: [ 4.595 , 4.645 ]
if(one_t$p.value < 0.05) {
cat("Decision: REJECT H₀ (p < 0.05)\n")
cat("Conclusion: The mean magnitude is significantly different from", test_value, "\n")
} else {
cat("Decision: FAIL TO REJECT H₀ (p ≥ 0.05)\n")
cat("Conclusion: Insufficient evidence that the mean differs from", test_value, "\n")
}## Decision: REJECT H₀ (p < 0.05)
## Conclusion: The mean magnitude is significantly different from 4.5
Question: Do manual transmission cars have different fuel efficiency than automatic transmission cars?
# Prepare data
automatic <- mtcars$mpg[mtcars$am == 0]
manual <- mtcars$mpg[mtcars$am == 1]
cat("=== INDEPENDENT SAMPLES T-TEST ===\n")## === INDEPENDENT SAMPLES T-TEST ===
## H₀: Mean MPG(manual) = Mean MPG(automatic)
## H₁: Mean MPG(manual) ≠ Mean MPG(automatic)
# Welch t-test (doesn't assume equal variances) - recommended by Field et al. [1]
two_t <- t.test(manual, automatic, var.equal = FALSE)
print(two_t)##
## Welch Two Sample t-test
##
## data: manual and automatic
## t = 3.7671, df = 18.332, p-value = 0.001374
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 3.209684 11.280194
## sample estimates:
## mean of x mean of y
## 24.39231 17.14737
##
## === DESCRIPTIVE STATISTICS ===
cat("Automatic: n =", length(automatic), ", Mean =", round(mean(automatic), 2),
", SD =", round(sd(automatic), 2), "\n")## Automatic: n = 19 , Mean = 17.15 , SD = 3.83
cat("Manual: n =", length(manual), ", Mean =", round(mean(manual), 2),
", SD =", round(sd(manual), 2), "\n")## Manual: n = 13 , Mean = 24.39 , SD = 6.17
## Difference: 7.24 MPG
if(two_t$p.value < 0.05) {
cat("Decision: REJECT H₀ (p < 0.05)\n")
cat("Conclusion: Manual and automatic transmissions have significantly different MPG.\n")
} else {
cat("Decision: FAIL TO REJECT H₀ (p ≥ 0.05)\n")
cat("Conclusion: No significant difference in MPG between transmission types.\n")
}## Decision: REJECT H₀ (p < 0.05)
## Conclusion: Manual and automatic transmissions have significantly different MPG.
# Box plot comparison
mtcars_plot <- mtcars %>%
mutate(Transmission = factor(am, labels = c("Automatic", "Manual")))
ggplot(mtcars_plot, aes(x = Transmission, y = mpg, fill = Transmission)) +
geom_boxplot(alpha = 0.7) +
geom_jitter(width = 0.1, alpha = 0.5) +
stat_summary(fun = mean, geom = "point", shape = 18, size = 4, color = "red") +
labs(title = "Fuel Efficiency by Transmission Type",
subtitle = paste("Welch t-test: t =", round(two_t$statistic, 2),
", p =", format.pval(two_t$p.value, digits = 3)),
x = "Transmission Type", y = "Miles Per Gallon (MPG)") +
theme_minimal() +
theme(legend.position = "none",
plot.title = element_text(hjust = 0.5, face = "bold"))Field et al. [1, Ch. 2] emphasize three critical points:
“Just because a test is significant doesn’t mean the effect is important! Very small, unimportant effects can be ‘significant’ with large samples.” [1, Ch. 2]
“A non-significant result only means the effect wasn’t large enough to detect. It does NOT prove the null hypothesis is true.” [1, Ch. 2]
Cohen [3] famously stated: “The null hypothesis is never true.”
“If p < .05, we’re saying ‘this is unlikely IF the null were true.’ We cannot logically conclude ‘therefore the null is definitely false.’” [1, Ch. 2]
When we make a decision about H₀, there are four possible outcomes [1, Ch. 2]:
| Decision | H₀ is Actually TRUE | H₀ is Actually FALSE |
|---|---|---|
| Reject H₀ | Type I Error (α) - False Positive | Correct Decision (Power) |
| Fail to Reject H₀ | Correct Decision | Type II Error (β) - False Negative |
What is it? Rejecting H₀ when it’s actually true—concluding there’s an effect when there isn’t (false positive) [1, Ch. 2].
Probability: α (typically set at 0.05)
Real-world example: Declaring a drug effective when it actually does nothing.
At α = .05: If we replicate 100 studies with NO real effect, we’d falsely detect an “effect” about 5 times!
What is it? Failing to reject H₀ when it’s actually false—missing a real effect (false negative) [1, Ch. 2].
Probability: β (Cohen [4] recommends maximum β = 0.20)
Real-world example: Declaring a drug ineffective when it actually works.
At β = .20: We would miss 1 in 5 genuine effects.
What is power? The probability of correctly detecting a real effect [1, Ch. 2].
\[Power = 1 - \beta\]
Cohen’s [4] recommendation: Aim for power ≥ .80 (80% chance of detecting real effects).
Key principle: Knowing any three, you can calculate the fourth!
# Power analysis for independent samples t-test
if (requireNamespace("pwr", quietly = TRUE)) {
# Calculate required sample size for different effect sizes
# Assuming α = 0.05, power = 0.80
effect_sizes <- c(0.2, 0.5, 0.8) # Small, medium, large (Cohen's d)
power_results <- sapply(effect_sizes, function(d) {
result <- pwr::pwr.t.test(d = d, sig.level = 0.05, power = 0.80, type = "two.sample")
ceiling(result$n) # Round up to whole number
})
power_df <- data.frame(
Effect_Size = c("Small (d = 0.2)", "Medium (d = 0.5)", "Large (d = 0.8)"),
Cohens_d = effect_sizes,
Required_N_per_group = power_results
)
power_df %>%
kable(caption = "Required Sample Size per Group (α = 0.05, Power = 0.80)") %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
cat("\nKey insight: Smaller effects require MUCH larger samples to detect!\n")
}##
## Key insight: Smaller effects require MUCH larger samples to detect!
if (requireNamespace("pwr", quietly = TRUE)) {
# Calculate power for different sample sizes
sample_sizes <- seq(10, 200, by = 10)
power_by_n <- data.frame(
N = rep(sample_sizes, 3),
Effect_Size = rep(c("Small (d=0.2)", "Medium (d=0.5)", "Large (d=0.8)"), each = length(sample_sizes)),
d = rep(c(0.2, 0.5, 0.8), each = length(sample_sizes))
) %>%
rowwise() %>%
mutate(Power = pwr::pwr.t.test(n = N, d = d, sig.level = 0.05, type = "two.sample")$power) %>%
ungroup()
ggplot(power_by_n, aes(x = N, y = Power, color = Effect_Size)) +
geom_line(linewidth = 1.5) +
geom_hline(yintercept = 0.80, linetype = "dashed", color = "gray50") +
annotate("text", x = 180, y = 0.83, label = "Target Power = 0.80", color = "gray50") +
labs(title = "Statistical Power Increases with Sample Size",
subtitle = "Larger effects are easier to detect with smaller samples",
x = "Sample Size per Group", y = "Statistical Power",
color = "Effect Size") +
scale_y_continuous(limits = c(0, 1), labels = scales::percent) +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5, face = "bold"),
legend.position = "bottom")
}Field et al. [1, Ch. 2] argue that effect sizes solve the problems with NHST:
“The use of effect sizes strikes a balance between using arbitrary cut-off points such as p < .05 and assessing whether an effect is meaningful within the research context.”
Key advantages: 1. Standardized = comparable across studies (meta-analysis) 2. Not dependent on sample size (unlike p-values) 3. Allows objective evaluation of practical importance 4. APA recommends reporting effect sizes in all publications [5]
What is Cohen’s d? It expresses the difference between two means in standard deviation units [4].
\[d = \frac{\bar{X}_1 - \bar{X}_2}{s_{pooled}}\]
Where \(s_{pooled}\) is the pooled standard deviation.
Cohen [4] provided rough guidelines:
| Effect Size | Cohen’s d | Pearson’s r | Variance Explained |
|---|---|---|---|
| Small | 0.2 | 0.1 | 1% |
| Medium | 0.5 | 0.3 | 9% |
| Large | 0.8 | 0.5 | 25% |
Important caveat [1, Ch. 2]: These are “canned” guidelines—not absolute rules. Effect size should be evaluated within the research context. A “small” effect on mortality might be hugely important!
# Calculate Cohen's d for the transmission comparison
if (requireNamespace("effectsize", quietly = TRUE)) {
# Using effectsize package (modern approach)
d_result <- effectsize::cohens_d(mpg ~ am, data = mtcars)
print(d_result)
cat("\n=== INTERPRETATION ===\n")
cat("Cohen's d =", round(abs(d_result$Cohens_d), 2), "\n")
d_abs <- abs(d_result$Cohens_d)
if (d_abs < 0.2) {
cat("Magnitude: Negligible effect\n")
} else if (d_abs < 0.5) {
cat("Magnitude: Small effect\n")
} else if (d_abs < 0.8) {
cat("Magnitude: Medium effect\n")
} else {
cat("Magnitude: Large effect\n")
}
cat("\nThe difference in MPG between manual and automatic transmissions\n")
cat("is approximately", round(d_abs, 2), "standard deviations.\n")
}## Cohen's d | 95% CI
## --------------------------
## -1.48 | [-2.27, -0.67]
##
## - Estimated using pooled SD.
## === INTERPRETATION ===
## Cohen's d = 1.48
## Magnitude: Large effect
##
## The difference in MPG between manual and automatic transmissions
## is approximately 1.48 standard deviations.
# Manual calculation of Cohen's d
mean_auto <- mean(automatic)
mean_manual <- mean(manual)
n_auto <- length(automatic)
n_manual <- length(manual)
# Pooled standard deviation
var_auto <- var(automatic)
var_manual <- var(manual)
pooled_var <- ((n_auto - 1) * var_auto + (n_manual - 1) * var_manual) / (n_auto + n_manual - 2)
pooled_sd <- sqrt(pooled_var)
# Cohen's d
d_manual <- (mean_manual - mean_auto) / pooled_sd
cat("=== MANUAL COHEN'S D CALCULATION ===\n")## === MANUAL COHEN'S D CALCULATION ===
## Mean (Automatic): 17.15
## Mean (Manual): 24.39
## Pooled SD: 4.9
## Cohen's d: 1.48
Field et al. [1, Ch. 2] prefer Pearson’s r because it’s bounded between 0 (no effect) and 1 (perfect effect).
Note: r is not linear—r = .6 is NOT twice as big as r = .3!
# Correlation between weight and MPG
r_result <- cor.test(mtcars$wt, mtcars$mpg)
cat("=== PEARSON'S r AS EFFECT SIZE ===\n")## === PEARSON'S r AS EFFECT SIZE ===
## r = -0.868
## p-value: 1.294e-10
## 95% CI: [ -0.934 , -0.744 ]
## Variance explained (r²): 75.3 %
if (r_abs < 0.1) {
cat("Magnitude: Negligible relationship\n")
} else if (r_abs < 0.3) {
cat("Magnitude: Small relationship\n")
} else if (r_abs < 0.5) {
cat("Magnitude: Medium relationship\n")
} else {
cat("Magnitude: Large relationship\n")
}## Magnitude: Large relationship
ggplot(mtcars, aes(x = wt, y = mpg)) +
geom_point(size = 3, alpha = 0.7, color = "steelblue") +
geom_smooth(method = "lm", se = TRUE, color = "red") +
labs(title = paste("Weight vs. MPG (r =", round(r_result$estimate, 2), ")"),
subtitle = paste("r² =", round(r_result$estimate^2, 2),
"— Weight explains", round(r_result$estimate^2 * 100, 1),
"% of variance in MPG"),
x = "Weight (1000 lbs)", y = "Miles Per Gallon") +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5, face = "bold"))Let’s tie everything together with a comprehensive example:
# Compare earthquake magnitudes at high vs. low depth
quakes_depth <- quakes %>%
mutate(depth_group = ifelse(depth > median(depth), "Deep", "Shallow"))
# Hypothesis test
depth_test <- t.test(mag ~ depth_group, data = quakes_depth)
# Effect size
if (requireNamespace("effectsize", quietly = TRUE)) {
depth_d <- effectsize::cohens_d(mag ~ depth_group, data = quakes_depth)
}
cat("=== COMPLETE ANALYSIS: Magnitude by Depth ===\n\n")## === COMPLETE ANALYSIS: Magnitude by Depth ===
# Descriptive statistics
quakes_depth %>%
group_by(depth_group) %>%
summarise(
n = n(),
mean_mag = mean(mag),
sd_mag = sd(mag),
se_mag = sd(mag) / sqrt(n())
) %>%
mutate(across(where(is.numeric), ~round(.x, 3))) %>%
kable(caption = "Descriptive Statistics by Depth Group") %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| depth_group | n | mean_mag | sd_mag | se_mag |
|---|---|---|---|---|
| Deep | 500 | 4.523 | 0.375 | 0.017 |
| Shallow | 500 | 4.718 | 0.407 | 0.018 |
##
## === HYPOTHESIS TEST ===
## t = -7.86
## df = 991.4
## p-value = 9.963e-15
cat("95% CI for difference: [", round(depth_test$conf.int[1], 3), ",",
round(depth_test$conf.int[2], 3), "]\n\n")## 95% CI for difference: [ -0.243 , -0.146 ]
## === EFFECT SIZE ===
if (exists("depth_d")) {
cat("Cohen's d =", round(depth_d$Cohens_d, 3), "\n")
cat("95% CI: [", round(depth_d$CI_low, 3), ",", round(depth_d$CI_high, 3), "]\n")
}## Cohen's d = -0.497
## 95% CI: [ -0.623 , -0.371 ]
##
## === INTERPRETATION ===
if (depth_test$p.value < 0.05) {
cat("There is a statistically significant difference in earthquake magnitude\n")
cat("between deep and shallow earthquakes.\n")
} else {
cat("There is no statistically significant difference in earthquake magnitude\n")
cat("between deep and shallow earthquakes.\n")
}## There is a statistically significant difference in earthquake magnitude
## between deep and shallow earthquakes.
if (exists("depth_d")) {
d_abs <- abs(depth_d$Cohens_d)
if (d_abs < 0.2) {
cat("However, the effect size is negligible (d < 0.2),\n")
cat("suggesting little practical significance.\n")
} else if (d_abs < 0.5) {
cat("The effect size is small (0.2 ≤ d < 0.5).\n")
} else if (d_abs < 0.8) {
cat("The effect size is medium (0.5 ≤ d < 0.8).\n")
} else {
cat("The effect size is large (d ≥ 0.8).\n")
}
}## The effect size is small (0.2 ≤ d < 0.5).
When we conduct multiple hypothesis tests, the probability of making at least one Type I error increases dramatically [1, Ch. 2].
Example: If we conduct 20 tests at α = 0.05, the probability of at least one false positive is:
\[P(\text{at least one Type I error}) = 1 - (1 - 0.05)^{20} = 0.64\]
That’s a 64% chance of at least one false positive!
# Demonstrate multiple comparison corrections
p_values <- c(0.001, 0.008, 0.039, 0.041, 0.042, 0.06, 0.074, 0.205, 0.212, 0.216)
corrections_df <- data.frame(
Original_p = p_values,
Bonferroni = p.adjust(p_values, method = "bonferroni"),
Holm = p.adjust(p_values, method = "holm"),
BH_FDR = p.adjust(p_values, method = "BH")
)
corrections_df <- corrections_df %>%
mutate(across(everything(), ~round(.x, 4)))
# Add significance indicators
corrections_df <- corrections_df %>%
mutate(
Original_sig = ifelse(Original_p < 0.05, "*", ""),
Bonferroni_sig = ifelse(Bonferroni < 0.05, "*", ""),
Holm_sig = ifelse(Holm < 0.05, "*", ""),
BH_sig = ifelse(BH_FDR < 0.05, "*", "")
)
corrections_df %>%
kable(caption = "Multiple Testing Corrections (* = significant at α = 0.05)") %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Original_p | Bonferroni | Holm | BH_FDR | Original_sig | Bonferroni_sig | Holm_sig | BH_sig |
|---|---|---|---|---|---|---|---|
| 0.001 | 0.01 | 0.010 | 0.0100 |
|
|
|
|
| 0.008 | 0.08 | 0.072 | 0.0400 |
|
|
||
| 0.039 | 0.39 | 0.312 | 0.0840 |
|
|||
| 0.041 | 0.41 | 0.312 | 0.0840 |
|
|||
| 0.042 | 0.42 | 0.312 | 0.0840 |
|
|||
| 0.060 | 0.60 | 0.312 | 0.1000 | ||||
| 0.074 | 0.74 | 0.312 | 0.1057 | ||||
| 0.205 | 1.00 | 0.615 | 0.2160 | ||||
| 0.212 | 1.00 | 0.615 | 0.2160 | ||||
| 0.216 | 1.00 | 0.615 | 0.2160 |
##
## === CORRECTION METHODS ===
## Bonferroni: Most conservative - divide α by number of tests
## Holm: Step-down procedure - more powerful than Bonferroni
## Benjamini-Hochberg (FDR): Controls false discovery rate - most lenient
| Concept | Definition | Key_Formula |
|---|---|---|
| Model + Error | Outcome = Model + Error — the fundamental equation | Xᵢ = X̄ + εᵢ |
| Sum of Squares (SS) | Sum of squared deviations from the mean | Σ(Xᵢ - X̄)² |
| Variance | Average squared deviation (SS / df) | SS / (N-1) |
| Standard Deviation | Square root of variance — in original units | √Variance |
| Standard Error | SD of sample means — measures precision | s / √N |
| Confidence Interval | Range likely containing the population parameter | X̄ ± (z × SE) |
| p-value | Probability of data if H₀ is true | P(data | H₀) |
| Effect Size | Standardized magnitude of effect | d = (M₁-M₂)/SD |
| Power | Probability of detecting a real effect | 1 - β |
Based on Field et al. [1]:
Everything is Model + Error — this is the foundation of statistics
Statistical significance ≠ practical importance — always report effect sizes
The null is never exactly true — Cohen [3] emphasized this repeatedly
p < .05 is arbitrary — Fisher [2] chose it for convenience, not principle
Sample size matters for power, not just for significance
Use confidence intervals — they tell you more than p-values alone
Effect sizes provide context — Cohen’s d and r help assess practical importance
When reporting results, include [1, Ch. 9]:
Example APA-style report: > Manual transmission cars (M = 24.39, SD = 6.17) had significantly higher fuel efficiency than automatic transmission cars (M = 17.15, SD = 3.83), t(18.33) = 3.77, p = .001, 95% CI [3.21, 11.27], d = 1.41 (large effect).
Using the following data: 5, 8, 12, 15, 20
Calculate a 95% confidence interval for the mean of
mtcars$hp.
Confidence Interval (CI): A range of values within which the population parameter is likely to fall [1, Ch. 2].
Cohen’s d: A standardized effect size measuring the difference between means in SD units [4].
Degrees of Freedom (df): The number of values free to vary in a calculation [1, Ch. 2].
Effect Size: A standardized measure of the magnitude of an effect, independent of sample size [1, Ch. 2].
Null Hypothesis (H₀): The hypothesis that no effect exists; any observed difference is due to chance [1, Ch. 2].
p-value: The probability of obtaining the observed data (or more extreme) if the null hypothesis is true [1, Ch. 2].
Power: The probability of correctly detecting a real effect (1 - β) [4].
Standard Deviation (SD): A measure of the spread of scores around the mean, in original units [1, Ch. 2].
Standard Error (SE): The standard deviation of sample means; measures precision of the sample mean estimate [1, Ch. 2].
Sum of Squares (SS): The sum of squared deviations from the mean; measures total error [1, Ch. 2].
Type I Error (α): Rejecting H₀ when it is true (false positive) [1, Ch. 2].
Type II Error (β): Failing to reject H₀ when it is false (false negative) [1, Ch. 2].
Variance: The average squared deviation from the mean (SS/df) [1, Ch. 2].
[1] A. Field, J. Miles, and Z. Field, Discovering Statistics Using R. London: SAGE Publications, 2012.
[2] R. A. Fisher, Statistical Methods for Research Workers. Edinburgh: Oliver and Boyd, 1925.
[3] J. Cohen, “Things I have learned (so far),” American Psychologist, vol. 45, no. 12, pp. 1304–1312, 1990.
[4] J. Cohen, Statistical Power Analysis for the Behavioral Sciences, 2nd ed. Hillsdale, NJ: Lawrence Erlbaum Associates, 1988.
[5] American Psychological Association, Publication Manual of the American Psychological Association, 7th ed. Washington, DC: APA, 2020.
[6] J. Cohen, “The earth is round (p < .05),” American Psychologist, vol. 49, no. 12, pp. 997–1003, 1994.
## R version 4.5.2 (2025-10-31 ucrt)
## Platform: x86_64-w64-mingw32/x64
## Running under: Windows 11 x64 (build 26200)
##
## Matrix products: default
## LAPACK version 3.12.1
##
## locale:
## [1] LC_COLLATE=English_United States.utf8
## [2] LC_CTYPE=English_United States.utf8
## [3] LC_MONETARY=English_United States.utf8
## [4] LC_NUMERIC=C
## [5] LC_TIME=English_United States.utf8
##
## time zone: America/New_York
## tzcode source: internal
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] MOTE_1.0.2 pwr_1.3-0 effectsize_1.0.1 moments_0.14.1
## [5] kableExtra_1.4.0 knitr_1.50 psych_2.5.6 lubridate_1.9.4
## [9] forcats_1.0.1 stringr_1.6.0 dplyr_1.1.4 purrr_1.2.0
## [13] readr_2.1.6 tidyr_1.3.1 tibble_3.3.0 ggplot2_4.0.1
## [17] tidyverse_2.0.0 seedhash_0.1.0
##
## loaded via a namespace (and not attached):
## [1] tidyselect_1.2.1 viridisLite_0.4.2 farver_2.1.2 S7_0.2.1
## [5] fastmap_1.2.0 reshape_0.8.10 bayestestR_0.17.0 digest_0.6.37
## [9] estimability_1.5.1 timechange_0.3.0 lifecycle_1.0.4 magrittr_2.0.4
## [13] compiler_4.5.2 rlang_1.1.6 sass_0.4.10 tools_4.5.2
## [17] yaml_2.3.10 labeling_0.4.3 mnormt_2.1.1 plyr_1.8.9
## [21] xml2_1.5.1 RColorBrewer_1.1-3 abind_1.4-8 withr_3.0.2
## [25] grid_4.5.2 datawizard_1.3.0 xtable_1.8-4 emmeans_2.0.0
## [29] scales_1.4.0 MASS_7.3-65 insight_1.4.3 cli_3.6.5
## [33] mvtnorm_1.3-3 rmarkdown_2.30 reformulas_0.4.2 generics_0.1.4
## [37] rstudioapi_0.17.1 reshape2_1.4.5 tzdb_0.5.0 parameters_0.28.3
## [41] minqa_1.2.8 cachem_1.1.0 splines_4.5.2 parallel_4.5.2
## [45] MBESS_4.9.41 vctrs_0.6.5 boot_1.3-32 Matrix_1.7-4
## [49] jsonlite_2.0.0 carData_3.0-5 car_3.1-3 hms_1.1.4
## [53] Formula_1.2-5 systemfonts_1.3.1 jquerylib_0.1.4 glue_1.8.0
## [57] nloptr_2.2.1 stringi_1.8.7 gtable_0.3.6 lme4_1.1-38
## [61] pillar_1.11.1 htmltools_0.5.8.1 R6_2.6.1 textshaping_1.0.4
## [65] Rdpack_2.6.4 evaluate_1.0.5 lattice_0.22-7 ez_4.4-0
## [69] rbibutils_2.4 bslib_0.9.0 Rcpp_1.1.0 svglite_2.2.2
## [73] coda_0.19-4.1 nlme_3.1-168 mgcv_1.9-3 xfun_0.54
## [77] pkgconfig_2.0.3
End of Week 04: R for Data Analytics Tutorial
ANLY 500 - Analytics I
Harrisburg University