This Week 03 tutorial builds on Week 02 and focuses on practical data analytics workflows in R:
Reference: We will draw conceptual guidance from Field, Miles, & Field (2012), Discovering Statistics Using R.
Upgrade note: This hands-on tutorial extends the Week 03 lecture slides in
03_introDA_2.rmd(frequency distributions, skew, kurtosis, central tendency, dispersion, z-scores, correlation) with code-first workflows and Field et al. (2012) aligned guidance.
install.packages("X") once, then
library(X).By the end of this tutorial, you will be able to:
# Install if needed
# install.packages(c("tidyverse", "readr", "readxl", "janitor", "skimr", "ggplot2", "dplyr", "tidyr", "kableExtra"))
library(tidyverse) # readr, dplyr, tidyr, ggplot2
library(readr)
library(readxl)
library(dplyr)
library(tidyr)
library(janitor) # clean_names, tabyl
library(skimr) # quick EDA summaries
library(ggplot2)
library(knitr)
library(kableExtra)
## Optional packages for assumptions, effects, and reporting
# install.packages(c("car", "effsize", "broom", "ggpubr"))
if (requireNamespace("car", quietly = TRUE)) library(car)
if (requireNamespace("effsize", quietly = TRUE)) library(effsize)
if (requireNamespace("broom", quietly = TRUE)) library(broom)
if (requireNamespace("ggpubr", quietly = TRUE)) library(ggpubr)
cat("\n=== REPRODUCIBLE SEED INFORMATION ===")##
## === REPRODUCIBLE SEED INFORMATION ===
##
## Generator Name: Week03 Reproducible Seeds - ANLY500
##
## MD5 Hash: 1a420df35934daa2fe5722fb9b541eaa
##
## Available Seeds: 200897381, 888044585, -419672191, -725164097, 726789518 ...
If you prefer automatic installs, define this helper and use
ensure_pkg("packagename") before
library(packagename).
ensure_pkg <- function(pkg) {
if (!requireNamespace(pkg, quietly = TRUE)) {
install.packages(pkg)
}
suppressPackageStartupMessages(library(pkg, character.only = TRUE))
}
# Example usage:
# ensure_pkg("tidyverse"); ensure_pkg("janitor")# CSV
# data <- read_csv("data/sales.csv")
# Excel
# data_xlsx <- read_excel("data/sales.xlsx", sheet = 1)
# Example: create a small demo dataset
sales <- tibble(
id = 1:10,
product = c("A","B","A","A","B","C","C","A","B","C"),
region = c("NE","NE","SE","MW","SE","MW","NE","SE","MW","NE"),
units = c(5, 7, 3, 6, 10, 2, 1, 8, 4, 6),
price = c(10, 10, 12, 9, 10, 11, 10, 12, 9, 11)
)
sales %>% head() %>% kable() %>% kable_styling(full_width = FALSE)| id | product | region | units | price |
|---|---|---|---|---|
| 1 | A | NE | 5 | 10 |
| 2 | B | NE | 7 | 10 |
| 3 | A | SE | 3 | 12 |
| 4 | A | MW | 6 | 9 |
| 5 | B | SE | 10 | 10 |
| 6 | C | MW | 2 | 11 |
## [1] "id" "product" "region" "units" "price"
Field et al. (2012, ch. 4) emphasize that the mechanism of missingness determines valid handling strategies. The three classical mechanisms:
Quick decision guide (intro context):
If MCAR and % missing small (<5%): listwise delete is acceptable.
If MAR: prefer multiple imputation (e.g., mice) or model-based methods.
If MNAR suspected: document potential bias; consider collecting auxiliary data or performing sensitivity analysis.
Why mean imputation is discouraged (Field et al., 2012): it shrinks variance, distorts correlations, and can bias test statistics. Use it here only to illustrate differences in downstream summaries, not for real analyses.
Below we artificially inject NA values and compare approaches. We also calculate missingness percentages to inform strategy selection.
For rigorous projects later in the course, explore: mice
(multiple imputation), naniar (visualizing missingness),
and adding predictors of missingness to reduce MAR → MNAR risk.
set.seed(seeds[2])
sales_na <- sales_tidy %>%
mutate(units = ifelse(row_number() %% 4 == 0, NA, units))
# Quick scan
skimr::skim(sales_na)| Name | sales_na |
| Number of rows | 10 |
| Number of columns | 5 |
| _______________________ | |
| Column type frequency: | |
| character | 2 |
| numeric | 3 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| product | 0 | 1 | 1 | 1 | 0 | 3 | 0 |
| region | 0 | 1 | 2 | 2 | 0 | 3 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| id | 0 | 1.0 | 5.50 | 3.03 | 1 | 3.25 | 5.5 | 7.75 | 10 | ▇▇▇▇▇ |
| units | 2 | 0.8 | 4.75 | 2.92 | 1 | 2.75 | 4.5 | 6.25 | 10 | ▇▇▇▃▃ |
| price | 0 | 1.0 | 10.40 | 1.07 | 9 | 10.00 | 10.0 | 11.00 | 12 | ▃▇▁▃▃ |
# Simple approach: remove rows with NA in units
sales_complete <- sales_na %>% drop_na(units)
# Alternative: mean impute (demonstration only — not recommended for inference)
sales_mean_imp <- sales_na %>% mutate(units = ifelse(is.na(units), mean(units, na.rm = TRUE), units))orig_mean <- mean(sales_tidy$units)
complete_mean <- mean(sales_complete$units)
imputed_mean <- mean(sales_mean_imp$units)
data.frame(
scenario = c("Original", "Listwise Complete", "Mean Imputed"),
mean_units = c(orig_mean, complete_mean, imputed_mean)
)Notice how mean imputation pulls the mean toward the center when missing values are not MCAR.
---
# Part 2: dplyr Verbs and Grouped Summaries (tidy pipelines)
## 2.1 Filter, Select, Arrange
These three verbs are the foundation of subsetting and organizing data frames in tidy workflows (Field et al., 2012, ch. 3–4 conceptual data screening). Conceptually: `filter()` keeps rows that satisfy logical predicates (row-wise reduction), `select()` narrows the set of columns (variable-wise focus), and `arrange()` reorders rows to surface patterns (e.g., largest revenues). Always confirm results with small previews (`head()`) to avoid silent mistakes.
``` r
sales_NE <- sales_complete %>% filter(region == "NE")
sales_small <- sales_complete %>% select(id, product, units)
sales_sorted <- sales_complete %>% arrange(desc(units))
list(NE = sales_NE, small = sales_small, sorted = sales_sorted)
## $NE
## # A tibble: 4 × 5
## id product region units price
## <int> <chr> <chr> <dbl> <dbl>
## 1 1 A NE 5 10
## 2 2 B NE 7 10
## 3 7 C NE 1 10
## 4 10 C NE 6 11
##
## $small
## # A tibble: 8 × 3
## id product units
## <int> <chr> <dbl>
## 1 1 A 5
## 2 2 B 7
## 3 3 A 3
## 4 5 B 10
## 5 6 C 2
## 6 7 C 1
## 7 9 B 4
## 8 10 C 6
##
## $sorted
## # A tibble: 8 × 5
## id product region units price
## <int> <chr> <chr> <dbl> <dbl>
## 1 5 B SE 10 10
## 2 2 B NE 7 10
## 3 10 C NE 6 11
## 4 1 A NE 5 10
## 5 9 B MW 4 9
## 6 3 A SE 3 12
## 7 6 C MW 2 11
## 8 7 C NE 1 10
Use mutate() to engineer new features or transform
existing variables in-place, a key step before summaries or modeling
(Field et al., 2012, ch. 3 feature creation). case_when()
enables vectorized conditional recoding without iterative loops,
producing clean categorical buckets. Thoughtful feature engineering
often improves interpretability and downstream test power.
sales_feat <- sales_complete %>%
mutate(
revenue = units * price,
bucket = case_when(
units >= 8 ~ "High",
units >= 4 ~ "Medium",
TRUE ~ "Low"
)
)
table(sales_feat$bucket)##
## High Low Medium
## 1 3 4
group_by() establishes a grouping structure so that
summarise() calculates statistics per group rather than
globally. This mirrors the descriptive phase emphasized by Field et
al. (2012, ch. 3–4) prior to formal tests—understanding subgroup counts,
central tendency, and dispersion helps anticipate assumption issues
(e.g., unequal variances) and guides hypothesis formulation.
summary_tbl <- sales_feat %>%
group_by(region, product) %>%
summarise(
n = n(),
avg_units = mean(units),
total_rev = sum(revenue),
.groups = "drop"
)
summary_tbl %>%
arrange(desc(total_rev)) %>%
kable(caption = "Grouped Summary by Region and Product") %>%
kable_styling(full_width = FALSE)| region | product | n | avg_units | total_rev |
|---|---|---|---|---|
| SE | B | 1 | 10.0 | 100 |
| NE | C | 2 | 3.5 | 76 |
| NE | B | 1 | 7.0 | 70 |
| NE | A | 1 | 5.0 | 50 |
| MW | B | 1 | 4.0 | 36 |
| SE | A | 1 | 3.0 | 36 |
| MW | C | 1 | 2.0 | 22 |
Joins merge relational tables by matching key columns. A
left_join() preserves all rows of the left (primary) table,
appending columns from the right when keys match; unmatched keys yield
NA for new columns. Conversely, anti_join() returns rows in
the left table with no match—useful for data integrity checks (e.g.,
products missing metadata). Thoughtful joining prevents inadvertent row
duplication that can bias summaries (Field et al., 2012, ch. 3 data
preparation principles).
products <- tibble(
product = c("A","B","C","D"),
category = c("Hardware","Hardware","Software","Hardware")
)
joined <- sales_feat %>% left_join(products, by = "product")
anti <- sales_feat %>% anti_join(products, by = "product") # rows with product not in products
list(joined_head = head(joined), not_matched = anti)## $joined_head
## # A tibble: 6 × 8
## id product region units price revenue bucket category
## <int> <chr> <chr> <dbl> <dbl> <dbl> <chr> <chr>
## 1 1 A NE 5 10 50 Medium Hardware
## 2 2 B NE 7 10 70 Medium Hardware
## 3 3 A SE 3 12 36 Low Hardware
## 4 5 B SE 10 10 100 High Hardware
## 5 6 C MW 2 11 22 Low Software
## 6 7 C NE 1 10 10 Low Software
##
## $not_matched
## # A tibble: 0 × 7
## # ℹ 7 variables: id <int>, product <chr>, region <chr>, units <dbl>,
## # price <dbl>, revenue <dbl>, bucket <chr>
Reshaping converts data between “wide” (many measurement columns) and “long” (one column of values plus identifiers). Long format simplifies grouped operations and plotting (single value column with factor indicating measure), while wide format can be convenient for certain modeling interfaces or human inspection. Field et al. (2012, ch. 3–4) stress tidy representations to reduce manual errors—ensure each row is a single observational unit after reshaping.
# Wide -> Long
wide <- tibble(id = 1:3, q1 = c(3,4,5), q2 = c(2,4,5))
long <- wide %>% pivot_longer(cols = starts_with("q"), names_to = "question", values_to = "score")
# Long -> Wide
restored <- long %>% pivot_wider(names_from = question, values_from = score)
list(long = long, restored = restored)## $long
## # A tibble: 6 × 3
## id question score
## <int> <chr> <dbl>
## 1 1 q1 3
## 2 1 q2 2
## 3 2 q1 4
## 4 2 q2 4
## 5 3 q1 5
## 6 3 q2 5
##
## $restored
## # A tibble: 3 × 3
## id q1 q2
## <int> <dbl> <dbl>
## 1 1 3 2
## 2 2 4 4
## 3 3 5 5
What is a univariate distribution? “Univariate” means looking at one variable at a time (uni = one). Before you compare groups or test relationships, you need to understand each variable individually. Think of it as getting to know each person before understanding how they interact.
Why does this matter? Looking at a single variable’s distribution tells you: - Shape: Is it bell-curved (normal), lopsided (skewed), or flat? - Center: Where do most values cluster? (mean, median) - Spread: Are values tightly packed or widely scattered? (standard deviation, range) - Outliers: Are there extreme or unusual values?
How to read a histogram: Each bar shows how many observations fall in that range. A tall bar = many cases with that value; a short bar = few cases. The overall shape matters—symmetric distributions allow different statistical tools than skewed ones.
Density curve (the red line): This smooth curve estimates the underlying probability distribution. It helps you see whether data is unimodal (one peak), bimodal (two peaks), or something else. Field et al. (2012, ch. 2–3) emphasize that recognizing severe skew or heavy tails early guides whether you should use means vs. medians, or consider data transformations.
Beginner tip: Always plot your data first! Numbers alone (mean = 5.2) don’t tell you if you have outliers or weird patterns.
ggplot(sales_feat, aes(units)) +
geom_histogram(bins = 6, fill = "steelblue", color = "white", alpha = 0.8) +
geom_density(color = "red", linewidth = 1) +
labs(title = "Distribution of Units", x = "Units", y = "Count") +
theme_minimal()What is a bivariate relationship? “Bivariate” means examining two variables together (bi = two) to see if they’re related. For example: Does price affect revenue? Do taller people weigh more?
Why scatter plots? Each point represents one
observation (one row of data). The x-axis shows one variable, the y-axis
shows another. Patterns in the cloud of points reveal relationships: -
Positive slope: As X increases, Y tends to increase
(upward trend) - Negative slope: As X increases, Y
tends to decrease (downward trend)
- No pattern: X and Y aren’t linearly related (random
scatter)
The fitted line: This is a “best-fit” straight line through the points (using linear regression). If points cluster tightly around it, the relationship is strong and linear. If points are widely scattered, the relationship is weak or nonlinear.
What is heteroscedasticity? (Also called “unequal spread”) This fancy term means the scatter/variability changes across the range. Example: Low prices might have consistent revenue (tight cluster), but high prices show wild variability (wide scatter). This matters because some statistical tests assume constant variability.
Color by group: Adding color (e.g., different regions) reveals whether the relationship differs by group—maybe price predicts revenue well in the Northeast but not in the Southeast. Field et al. (2012, ch. 6) calls this an interaction effect: the relationship depends on which group you’re in.
Beginner tip: Always plot before calculating correlation! A curved relationship or outliers can fool correlation numbers.
ggplot(sales_feat, aes(price, revenue, color = region)) +
geom_point(size = 3, alpha = 0.8) +
geom_smooth(method = "lm", se = FALSE, color = "black") +
labs(title = "Price vs. Revenue by Region") +
theme_minimal()What are categorical comparisons? Here you’re comparing a numeric outcome (e.g., revenue, test scores) across categories (e.g., product types, regions, treatment groups). Question format: “Do products A, B, and C generate different average revenues?”
Why bar charts with error bars? - Bar
height: Shows the average (mean) for each group
- Error bars: Show variability/uncertainty. Short bars
= consistent data; long bars = high variability -
Overlap: If error bars overlap a lot, group means might
not be truly different—differences could be due to random chance
What is standard error (SE)? SE measures how much your sample mean would bounce around if you collected multiple samples. Smaller SE = more confident your sample mean is close to the true population mean. It shrinks with larger sample sizes.
Preliminary hypotheses: This visual exploration (Field et al., 2012, ch. 3, 9) helps you form educated guesses before formal testing: “It looks like Product A earns more than B—let’s test if that’s statistically significant.”
Important caveat: Visual overlap is a hint, not proof. You need formal tests (like t-tests or ANOVA) to account for sample size and variability properly. Two groups might look different visually but not reach statistical significance, or vice versa.
Beginner tip: If bars are very different heights AND error bars don’t overlap, you likely have a real difference. If bars are similar or error bars overlap heavily, differences might be noise.
ggplot(sales_feat, aes(product, revenue, fill = region)) +
stat_summary(fun = mean, geom = "col", position = position_dodge()) +
stat_summary(fun.data = mean_se, geom = "errorbar", position = position_dodge(.9), width = .2) +
labs(title = "Average Revenue by Product and Region") +
theme_minimal()Why do shape metrics matter? Many statistical tests (t-tests, ANOVA, regression) assume data follow a roughly “normal” (bell-shaped) distribution. If your data is severely skewed or has extreme outliers (“heavy tails”), those tests may give misleading results. Field et al. (2012) recommend checking shape before running formal tests—if assumptions are violated, use robust methods or transformations.
Plain English: Skewness measures asymmetry—whether your distribution leans to one side.
Rule of thumb (Field et al., 2012): - |skewness| < 0.5: approximately symmetric (safe for most tests) - |skewness| 0.5–1.0: moderate skew (check robustness) - |skewness| > 1.0: highly skewed (consider medians/IQR, transformations, or nonparametric tests)
Plain English: Kurtosis measures tail heaviness—how much probability sits in the extreme tails vs. the center/shoulders of the distribution.
Why it matters: High kurtosis means outliers are more common—this inflates variance and can distort means. Tests assuming normality may be unreliable.
Excess kurtosis: Most software reports “excess” = raw kurtosis − 3, so normal distribution has excess = 0. Easier to interpret: positive = heavy tails, negative = light tails.
Beginner takeaway: If skewness or kurtosis is large, your data isn’t bell-shaped. Use medians instead of means, or try log/sqrt transformations to reduce skew before testing.
if (!requireNamespace("moments", quietly = TRUE)) {
# install.packages("moments")
}
if (!requireNamespace("psych", quietly = TRUE)) {
# install.packages("psych")
}
library(moments)
library(psych)
units_vec <- sales_feat$units
data.frame(
mean = mean(units_vec),
median = median(units_vec),
skewness = moments::skewness(units_vec),
kurtosis = moments::kurtosis(units_vec)
)Interpretation cues (Field et al., 2012): - |skew| > 1 ≈ highly skewed; 0.5–1 ≈ moderate; < 0.5 ≈ approximately symmetric. - Excess kurtosis (kurtosis − 3): > 0 leptokurtic (heavy tails); < 0 platykurtic (light tails).
What is a z-score? A z-score (also called a “standard score”) tells you how many standard deviations a value is away from the mean. It’s a way to standardize different measurements onto the same scale.
The formula: z = (value − mean) / SD
Real-world analogy: Imagine comparing student
performance on two exams: - Math test: mean = 70, SD = 10; you scored
85
- English test: mean = 80, SD = 5; you scored 90
Which score is better relative to classmates? Raw scores (85 vs 90) don’t tell you—they’re on different scales.
Convert to z-scores: - Math: z = (85 − 70) / 10 =
1.5 (1.5 SDs above average)
- English: z = (90 − 80) / 5 = 2.0 (2 SDs above
average)
You performed better in English relative to peers, even though the raw score gap was smaller!
Why standardize?
1. Compare apples to oranges: Variables measured in
different units (dollars, kilograms, test scores) become
comparable
2. Spot outliers: Values with |z| > 3 are extremely
rare (>99.7% of data falls within ±3 SD in a normal distribution). A
z-score of 4 is highly unusual—possible outlier or data error.
3. Prepare for modeling: Many machine learning
algorithms perform better when features are standardized
Key properties after standardization: - Mean of
z-scores = 0
- SD of z-scores = 1
- Shape unchanged (skewed data stays skewed)
Limitations (Field et al., 2012): Z-scores assume the mean and SD are meaningful summaries. If your data is heavily skewed or has extreme outliers, the mean/SD are distorted—z-scores won’t help much. Consider robust alternatives like median absolute deviation or transformations first.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -1.28624 -0.68599 -0.08575 0.00000 0.51450 1.80074
Interpreting z-scores: - z = 0: exactly at the
mean
- z = 1: one SD above the mean (better than ~84% of observations in
normal data)
- z = −1: one SD below the mean
- z > 3 or z < −3: extreme value—potential outlier (Field et al.,
2012)
Beginner tip: After standardizing, check the summary—mean should be very close to 0, SD should be 1. If not, something went wrong!
What is correlation? Correlation measures the strength and direction of the linear relationship between two numeric variables. It answers: “When X goes up, does Y tend to go up (positive), go down (negative), or stay random (no correlation)?”
The correlation coefficient (r): A number between −1
and +1: - r = +1: Perfect positive relationship (all
points on an upward line)
- r = +0.7: Strong positive (as X increases, Y tends to
increase)
- r = 0: No linear relationship (cloud of random
points)
- r = −0.7: Strong negative (as X increases, Y tends to
decrease)
- r = −1: Perfect negative relationship (all points on
a downward line)
Rough interpretation guide: - |r| < 0.3: weak
correlation
- |r| 0.3–0.7: moderate correlation
- |r| > 0.7: strong correlation
(Context matters! In social sciences, r = 0.5 is often considered
strong; in physics, you might expect r > 0.9.)
What it is: The standard correlation. It measures linear association assuming both variables are approximately normally distributed (Field et al., 2012, ch. 6).
When to use it: - Both variables are continuous (not
categorical)
- Relationship is linear (scatter plot shows straight-line
pattern)
- No severe outliers
- Roughly normal distributions (bell-shaped)
How it works: Pearson r uses the actual values of X and Y. It’s sensitive to outliers—one extreme point can inflate or deflate r dramatically.
Example: Height and weight usually have r ≈ 0.7–0.8 (strong positive)—taller people tend to weigh more.
What it is: A rank-based correlation. Instead of using raw values, it ranks the data (1st smallest, 2nd smallest, etc.) and correlates the ranks.
When to use it (Field et al., 2012): - Data is
ordinal (rankings, Likert scales: “strongly disagree” to “strongly
agree”)
- Non-normal distributions or outliers present
- Relationship is monotonic but not necessarily linear (e.g.,
curved but consistently increasing)
Advantage: Robust to outliers and skew. One extreme value won’t distort the rank much.
Example: Education level (high school, bachelor’s, master’s, PhD) vs. income. The relationship may not be perfectly linear (PhD income varies widely), but it’s monotonic (higher education → higher income on average). Spearman handles this better than Pearson.
Correlation ≠ Causation: r = 0.9 doesn’t mean X causes Y. Both could be driven by a third variable (ice cream sales and drownings both increase in summer—doesn’t mean ice cream causes drowning).
Nonlinear patterns: Pearson r can be near zero even with a strong curved relationship (e.g., U-shaped or exponential). Always plot your data! A scatter plot reveals patterns numbers can’t (Field et al., 2012, ch. 6).
Beginner tip: If your scatter plot looks like a random blob, correlation won’t be useful—there’s no systematic relationship to quantify.
num_df <- sales_feat %>% select(units, price, revenue)
cor(num_df, use = "pairwise.complete.obs", method = "pearson")## units price revenue
## units 1.0000000 -0.2273142 0.9933054
## price -0.2273142 1.0000000 -0.1193118
## revenue 0.9933054 -0.1193118 1.0000000
##
## Pearson's product-moment correlation
##
## data: num_df$price and num_df$revenue
## t = -0.29436, df = 6, p-value = 0.7784
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.7600804 0.6390934
## sample estimates:
## cor
## -0.1193118
# Nonparametric alternative (rank-based)
cor.test(num_df$price, num_df$revenue, method = "spearman")##
## Spearman's rank correlation rho
##
## data: num_df$price and num_df$revenue
## S = 94.791, p-value = 0.7618
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## -0.1284693
Reporting template (APA-style cues): r(8) = .52, p = .12, 95% CI [.10, .78] (Field et al., 2012).
Why check assumptions? Statistical tests like t-tests and ANOVA are based on mathematical models that assume your data has certain properties. If these assumptions are badly violated, your p-values and conclusions can be wrong—you might claim a “significant” difference that’s actually just noise, or miss a real difference.
Good news: Most tests are “robust” to minor violations—small departures from perfect normality usually don’t matter much. The goal is to catch severe violations.
What it means: Each observation is unrelated to others—no hidden pairing or clustering.
Examples of violations: - Measuring the same person
multiple times (repeated measures—use paired t-test or mixed models
instead)
- Students nested in classrooms (scores within a class may be
correlated—use multilevel models)
- Time series data (today’s temperature affects tomorrow’s)
How to check: This comes from your study design, not the data. Ask: “Could one observation influence another?”
Fix if violated: Use appropriate models (paired tests, repeated measures ANOVA, mixed models).
What it means: Within each group, the data follows an approximately bell-shaped (normal/Gaussian) distribution.
Why it matters: t-tests use the mean and assume sampling distributions are normal. With severe skew or outliers, the mean is misleading and the test loses power or gives wrong error rates.
How to check: - Visual: Histogram
(should look bell-shaped), Q–Q plot (points should follow the diagonal
line)
- Formal test: Shapiro-Wilk test (p > 0.05 = normal;
p < 0.05 = non-normal)
Caveat: With large samples, Shapiro-Wilk rejects normality for
trivial deviations—trust plots more than p-values.
What to do if violated: - Small deviations are OK
(t-tests are robust with n > 30)
- Severe skew: use median/nonparametric tests (Wilcoxon, Mann-Whitney),
or transform data (log, square-root)
- Outliers: investigate (data error? real extreme case?) and consider
robust methods
Q–Q Plot explained: “Quantile-Quantile” plot compares your data’s quantiles to a theoretical normal distribution. If points hug the diagonal line, data is normal. Curves or S-shapes = non-normality. Points peeling off at ends = heavy/light tails.
What it means: The spread (variance, SD) is similar across groups you’re comparing.
Example: You’re comparing test scores: Group A has scores 60–70 (tight cluster, SD = 3); Group B has scores 40–90 (wide spread, SD = 15). Variances are unequal—this violates the assumption for the classic Student’s t-test.
Why it matters: Unequal variances distort the t-test’s error rates, especially with unequal group sizes.
How to check: - Visual:
Side-by-side boxplots or histograms—if one group’s box is much
taller/wider, variances may differ
- Formal test: Levene’s test (p > 0.05 = equal
variances; p < 0.05 = unequal)
Fix if violated: Use Welch’s t-test
(the R default t.test())—it doesn’t assume equal variances
and adjusts degrees of freedom. This is safer and recommended by Field
et al. (2012, ch. 9).
Beginner tip: Don’t obsess over perfect normality. Field et al. (2012) emphasize that t-tests work well even with moderate violations, especially if sample sizes are balanced and n > 30. Focus on catching severe problems (extreme skew, huge outliers, vastly unequal variances).
# Example groups: compare revenue across three regions to inspect distributions
ggplot(sales_feat, aes(x = revenue)) +
geom_histogram(bins = 8, fill = "gray70", color = "white") +
facet_wrap(~ region, ncol = 3) +
theme_minimal() +
labs(title = "Revenue Histograms by Region")# Q–Q plots (requires ggpubr) — only for groups with n >= 3
if (requireNamespace("ggpubr", quietly = TRUE)) {
df_q <- dplyr::group_by(sales_feat, region) |> dplyr::filter(dplyr::n() >= 3)
if (nrow(df_q) > 0) ggpubr::ggqqplot(df_q, x = "revenue", facet.by = "region")
}# Shapiro–Wilk per group (guard for n < 3; for large n, it over-rejects)
region_counts <- table(sales_feat$region)
region_counts##
## MW NE SE
## 2 4 2
shapiro_by_region <- lapply(split(sales_feat$revenue, sales_feat$region), function(x) {
x <- x[!is.na(x)]
if (length(x) >= 3) {
tryCatch(stats::shapiro.test(x), error = function(e) list(error = e$message))
} else {
list(note = "n < 3; Shapiro-Wilk requires at least 3 observations")
}
})
shapiro_by_region## $MW
## $MW$note
## [1] "n < 3; Shapiro-Wilk requires at least 3 observations"
##
##
## $NE
##
## Shapiro-Wilk normality test
##
## data: x
## W = 0.85567, p-value = 0.2451
##
##
## $SE
## $SE$note
## [1] "n < 3; Shapiro-Wilk requires at least 3 observations"
# Levene's test for homogeneity (car)
if (requireNamespace("car", quietly = TRUE)) {
car::leveneTest(revenue ~ region, data = sales_feat)
}Guidance (Field et al., 2012): - Normality: minor deviations are
acceptable; focus on strong skew/outliers. - Homogeneity: if variances
differ, use Welch’s t test (default in t.test). -
Independence: comes from design; ensure grouping is not repeated
measures.
What is a one-sample t-test? It tests whether your sample’s average differs from a specific value you care about (a benchmark, target, or theoretical expectation).
Real-world examples: - A factory claims light bulbs
last 1000 hours on average. You test 50 bulbs: is the true mean really
1000, or different?
- A school targets 75% pass rate. This year’s sample shows 78%—is that a
real improvement or just random variation?
- You expect healthy adults to have resting heart rate = 70 bpm. Your
sample’s mean is 68—significantly different?
When to use it: You have one sample and want to compare its mean to a known/hypothesized value.
Hypotheses:
- H₀ (null hypothesis): The population mean equals the
target value (μ = μ₀). Any difference in your sample is just random
chance.
- H₁ (alternative hypothesis): The population mean
differs from the target (μ ≠ μ₀). The difference is real, not
chance.
In our example: - H₀: Mean units = 5 (no difference
from target)
- H₁: Mean units ≠ 5 (sample is systematically higher or lower)
How the test works:
1. Calculate how far your sample mean is from the target (μ₀)
2. Account for sample size and variability (smaller samples and high
variability = less certainty)
3. Convert this to a t-statistic: t = (sample mean − μ₀) / (SE
of mean)
4. Compare t to a reference distribution (t-distribution) to get a
p-value
p-value interpretation: - p < 0.05:
“Statistically significant”—the difference is unlikely due to chance
alone (reject H₀)
- p ≥ 0.05: Not enough evidence to claim a real difference (fail to
reject H₀)
(The 0.05 cutoff is a convention; context matters—medical studies often
use 0.01.)
Confidence Interval (CI): The 95% CI tells you the plausible range for the true population mean. If the target value (5) falls outside this interval, the test is significant at p < 0.05.
Assumptions (Field et al., 2012): -
Normality: Data approximately bell-shaped (or n >
30, where Central Limit Theorem helps)
- Independence: Observations unrelated
If normality is badly violated (severe skew, small sample), use the
Wilcoxon signed-rank test (nonparametric
alternative).
Field et al. (2012, ch. 9) recommend verifying assumptions (normality, independence) before t tests.
set.seed(seeds[3])
units_vec <- sales_feat$units
# Check: Shapiro-Wilk for small sample (illustrative)
shapiro.test(units_vec)##
## Shapiro-Wilk normality test
##
## data: units_vec
## W = 0.97302, p-value = 0.9206
##
## One Sample t-test
##
## data: units_vec
## t = -0.24254, df = 7, p-value = 0.8153
## alternative hypothesis: true mean is not equal to 5
## 95 percent confidence interval:
## 2.312601 7.187399
## sample estimates:
## mean of x
## 4.75
What is effect size? Statistical significance (p-value) tells you whether an effect exists (is it real or just chance?). Effect size tells you how big that effect is—its practical importance.
Why does this matter? With huge samples, even tiny, meaningless differences become “statistically significant” (p < 0.05). Conversely, with tiny samples, large important differences might not reach significance. Effect size provides context.
Real-world analogy: Imagine two scenarios: 1. A drug
lowers blood pressure by 0.5 mmHg on average, p = 0.001 (highly
significant with 10,000 patients)
2. A drug lowers blood pressure by 15 mmHg on average, p = 0.08 (not
significant with only 20 patients)
Which drug is more important clinically? #2! Despite lacking statistical significance (small sample), the effect is huge. Effect size captures this.
d = (sample mean − target value) / SD
Interpretation: Cohen’s d expresses the difference in standard deviation units—similar logic to z-scores.
Cohen’s Benchmarks (rough guidelines, Field et al.,
2012, ch. 9): - d = 0.2: Small effect (subtle, hard to
notice)
- d = 0.5: Medium effect (noticeable to careful
observer)
- d = 0.8: Large effect (obvious, practically
important)
Important caveat: These benchmarks are general rules. Context matters! In education, d = 0.4 might be considered large (hard to move learning outcomes). In physics, d = 1.5 might be small (equipment precision).
Example: If light bulbs last 1050 hours (target
1000) with SD = 100:
d = (1050 − 1000) / 100 = 0.5 (medium effect—half a standard deviation
longer)
If SD = 500:
d = (1050 − 1000) / 500 = 0.1 (tiny effect—barely longer than expected
variability)
Same 50-hour difference, but different practical meaning depending on variability!
Why report both p-value and effect size? (Field et
al., 2012)
- p-value: Is the effect real or chance?
- Effect size: Is the effect big enough to care about?
Beginner tip: A statistically significant result (p < 0.05) with tiny effect size (d = 0.1) might not be worth acting on. Look for results that are both significant and have meaningful effect sizes.
## [1] -0.08574929
Interpretation (rough guide): 0.2 small, 0.5 medium, 0.8 large (but always consider your domain!).
What is an independent samples t-test? It compares the means of two separate (independent) groups to see if they differ significantly.
Real-world examples: - Do men and women have
different average salaries?
- Is blood pressure lower in the treatment group vs. control
group?
- Do students taught with Method A score higher than those taught with
Method B?
Key word: “Independent” — The two groups contain different people/units. (If the same people are measured twice—before/after—you need a paired t-test instead.)
Hypotheses:
- H₀: Group means are equal (μ₁ = μ₂). Any difference
is random sampling variation.
- H₁: Group means differ (μ₁ ≠ μ₂). There’s a real
systematic difference.
In our example: - H₀: Mean revenue in Northeast =
Mean revenue in Southeast
- H₁: Mean revenue Northeast ≠ Southeast
How the test works:
1. Calculate the difference between group means
2. Estimate how much means would bounce around due to random chance
(standard error of the difference)
3. t-statistic = (difference in means) / (standard error)
4. Look up t in a reference distribution to get p-value
p-value interpretation:
- p < 0.05: “Significant difference”—groups genuinely differ
- p ≥ 0.05: No significant difference—observed gap could be chance
Confidence Interval: The 95% CI for the difference in means tells you the plausible range. If CI includes zero, groups aren’t significantly different.
Student’s t-test (classic version):
- Assumes equal variances (spread) in both groups
- If this assumption is violated, results are unreliable (especially
with unequal sample sizes)
Welch’s t-test (R default):
- Does not assume equal variances
- Adjusts degrees of freedom to account for different spreads
- More robust and safer—Field et al. (2012, ch. 9) recommend it as the
default
Practical advice: Always use Welch’s t-test (which R
does by default in t.test()) unless you have strong
theoretical reasons to assume equal variances. It’s safer and performs
well even when variances are equal.
Assumptions: 1. Independence:
Observations in Group 1 are unrelated to Group 2
2. Normality: Data in each group approximately
bell-shaped (or n > 30 per group)
3. Equal variances: Only for Student’s t; Welch relaxes
this
If normality is badly violated (severe skew, outliers) in small samples, use Mann-Whitney U / Wilcoxon rank-sum test (nonparametric alternative).
Beginner tip: Inspect boxplots or histograms for each group first—look for similar shapes and spreads. Large differences in spread suggest Welch is the right choice (which it already is by default!).
sub2 <- sales_feat %>% filter(region %in% c("NE","SE"))
t.test(revenue ~ region, data = sub2) # Welch by default##
## Welch Two Sample t-test
##
## data: revenue by region
## t = -0.54584, df = 1.3846, p-value = 0.66
## alternative hypothesis: true difference in means between group NE and group SE is not equal to 0
## 95 percent confidence interval:
## -254.7101 216.7101
## sample estimates:
## mean in group NE mean in group SE
## 49 68
What is Cohen’s d for two groups? It measures how far apart the two group means are, expressed in standard deviation units. This standardization makes effect sizes comparable across different studies and measures.
Formula:
d = (Mean₁ − Mean₂) / pooled SD
The “pooled SD” is a weighted average of the two groups’ standard deviations—it represents typical within-group variability.
Intuition: Cohen’s d answers: “How many standard deviations apart are these groups?”
Example: Compare test scores: - Group A: mean = 75,
SD = 10
- Group B: mean = 85, SD = 10
- Difference = 10 points
- d = 10 / 10 = 1.0 (large effect—groups are one full SD apart)
Now compare salaries: - Group A: mean = $50,000, SD = $20,000
- Group B: mean = $55,000, SD = $20,000
- Difference = $5,000
- d = 5,000 / 20,000 = 0.25 (small effect—only a quarter SD apart)
The $5,000 salary difference sounds big in absolute terms, but relative to variability it’s modest.
Cohen’s Benchmarks (Field et al., 2012,
ch. 9):
- d ≈ 0.2: Small effect (subtle difference)
- d ≈ 0.5: Medium effect (noticeable to trained
observer)
- d ≈ 0.8: Large effect (obvious, substantial
difference)
Critical context (Field et al., 2012): These are
rough guidelines. What counts as “large” depends on your field:
- Education: d = 0.4 is often considered large (hard to
improve learning outcomes)
- Psychology: d = 0.5–0.8 is typical for
interventions
- Medicine: d = 0.2 might be clinically important if
it’s a life-threatening condition
- Physics: d = 1.5 might be small (high precision
expected)
Always interpret effect sizes in the context of your domain, prior research, and practical importance. A statistically significant result with d = 0.1 might not be worth implementing; a non-significant result with d = 0.7 (due to small sample) might warrant further investigation.
Hedges’ g: A refined version of Cohen’s d with a
small-sample correction, reported by some packages (like
effsize). Differences are trivial unless n < 20 per
group.
Beginner tip: When reporting results, always include
both: 1. Statistical significance (p-value)—is the effect real?
2. Effect size (Cohen’s d)—is the effect important?
A significant p-value with tiny d suggests the effect is real but trivial. Large d with non-significant p suggests promising but underpowered (need larger sample).
if (requireNamespace("effsize", quietly = TRUE)) {
effsize::cohen.d(revenue ~ region, data = sub2, hedges.correction = TRUE)
}##
## Hedges's g
##
## g estimate: -0.4635957 (small)
## 95 percent confidence interval:
## lower upper
## -2.410005 1.482814
Tests whether an observed proportion (success rate) aligns with a hypothesized benchmark. When sample sizes are small or expected counts low, exact binomial tests may be preferable to large-sample approximations; conceptually parallels mean tests but for binary outcomes (Field et al., 2012, ch. 9 categorical comparison logic).
##
## 1-sample proportions test with continuity correction
##
## data: high_n out of N, null probability 0.3
## X-squared = 0.48214, df = 1, p-value = 0.4875
## alternative hypothesis: true p is not equal to 0.3
## 95 percent confidence interval:
## 0.006559917 0.533210814
## sample estimates:
## p
## 0.125
Report test statistic, df, p-value, and effect size where possible (Field et al., 2012). Example: “The mean units did not differ from 5, t(8) = 0.87, p = .41, 95% CI [4.1, 6.2].”
Automating reporting reduces manual transcription errors (incorrect df, p). The template enforces inclusion of central elements (test statistic, df, p, CI). Always verify direction and context to avoid ambiguous statements (Field et al., 2012, ch. 9 recommendations).
if (requireNamespace("broom", quietly = TRUE)) {
tt <- t.test(units_vec, mu = 5)
tb <- broom::tidy(tt)
sprintf(
"The mean units %s differ from 5, t(%0.0f) = %0.2f, p = %0.3f, 95%% CI [%0.2f, %0.2f].",
ifelse(tb$p.value < 0.05, "significantly", "did not"),
tb$parameter, tb$statistic, tb$p.value, tb$conf.low, tb$conf.high
)
}## [1] "The mean units did not differ from 5, t(7) = -0.24, p = 0.815, 95% CI [2.31, 7.19]."
These recurring issues often degrade validity or interpretability. Systematically scanning for them before formal testing (pre-flight checklist) lowers the chance of post-hoc rationalizations and improves reproducibility of your analysis narrative.
# Nonparametric alternative to independent t test
wilcox.test(revenue ~ region, data = sub2, exact = FALSE)##
## Wilcoxon rank sum test with continuity correction
##
## data: revenue by region
## W = 3, p-value = 0.817
## alternative hypothesis: true location shift is not equal to 0
Exercises reinforce workflow fluency: importing, wrangling, exploring, testing, and interpreting. Prioritize clear rationale for each decision (choice of summary, test selection) to build habits consistent with transparent statistical reporting (Field et al., 2012).
Import a CSV of your choice (or create one) and:
janitor::clean_names()mutate()group_by() and
summarise()Create a plot that compares a numeric outcome across groups with mean and error bars.
Run an independent samples t test for a meaningful comparison in your data. State H0 and H1 explicitly.
Briefly interpret your EDA plots and test results in plain language.
Compute and interpret an effect size for one of your tests (Cohen’s d for t tests). Explain what “small/medium/large” means in your context.
If assumptions are not met, rerun the analysis with an appropriate robust or nonparametric alternative and compare conclusions.
This section consolidates learner support materials: common error fixes, a quick glossary of R/data analytics terms, and reference mapping to Field et al. (2012).
install.packages("X"), then library(X).library(pkg)) or misspelled the function.sales (the import or demo data block) or check
spelling.here::here() and ensure the file exists.as.numeric() or fix the data
type earlier.Below are concise explanations of each statistical term or function used in this tutorial, paraphrased and aligned with guidance from Field, Miles, & Field (2012). Citations indicate the relevant chapter (ch.).
filter(): Returns rows matching
logical conditions (subset operation).select(): Chooses columns by name or
helper functions; does not modify data content.mutate(): Adds or transforms columns
by applying calculations row-wise.summarise(): Reduces many rows to
summary statistics (e.g., mean, count); usually used with
group_by().group_by(): Defines grouping structure
so subsequent summarise or mutate operations act per group.pivot_longer() /
pivot_wider(): Reshape data between long format
(one row per measurement) and wide format (one row per entity, multiple
measurement columns).set.seed(): Fixes the random number
generator state to ensure reproducible simulation or sampling.Tip: Print intermediate results (head(df),
str(df), summary(df)) to validate workflow at
each stage; this reduces debugging time.
Source reference for definitions unless otherwise noted: Field, A., Miles, J., & Field, Z. (2012). Discovering Statistics Using R. Chapters indicated above.
Local reference notes:
ANLY500-Analytics-I/Knowledge/Field_ea_2012_Discovering_Statistics_using_R_normalized.txt
(or absolute path
D:\Github\data_sciences\ANLY500-Analytics-I\Knowledge\Field_ea_2012_Discovering_Statistics_using_R_normalized.txt).
Quick one-line reminders for fast scanning (see Part 7.2 for extended definitions):
End of Week 03 tutorial.