This Week 05 tutorial builds on Weeks 02-04 and focuses on
data visualization using R’s powerful
ggplot2 package. This tutorial upgrades the lecture
concepts from 05_graphs.rmd into hands-on, code-first
workflows:
Reference: This tutorial draws heavily from Field, Miles, and Field [1], Discovering Statistics Using R, Chapter 4 on data visualization, with IEEE-style citations throughout.
Upgrade note: This hands-on tutorial extends the Week 05 lecture slides in
05_graphs.rmdwith code-first workflows, beginner-friendly explanations, and theoretical grounding from Field et al. [1].
install.packages("X") once, then
library(X).By the end of this tutorial, you will be able to:
# Install if needed
# install.packages(c("tidyverse", "ggplot2", "dplyr", "tidyr", "rio", "reshape", "Hmisc", "kableExtra"))
library(tidyverse)
library(ggplot2)
library(dplyr)
library(tidyr)
library(rio)
library(reshape) # For melt() - note: modern alternative is tidyr::pivot_longer()
library(Hmisc) # For confidence interval error bars
library(knitr)
library(kableExtra)
cat("\n=== REPRODUCIBLE SEED INFORMATION ===")##
## === REPRODUCIBLE SEED INFORMATION ===
##
## Generator Name: Week05 Data Visualization - ANLY500
##
## MD5 Hash: 1684addbea1f102090becd04b4f27699
##
## Available Seeds: -328206224, -138599026, 433056330, 714983507, 289841207 ...
What is data visualization? Field et al. [1, Ch. 4] emphasize that graphs are not merely decorative—they are essential tools for understanding data. Before running any statistical test, you should always plot your data to:
“The human visual system is extremely good at detecting patterns, and a well-designed graph can reveal features of the data that would be difficult to detect from tables of numbers alone.” [1, Ch. 4]
Tufte [2] established foundational principles for data visualization that Field et al. [1, Ch. 4] adopt:
A good graph should:
| Principle | Good_Example | Bad_Example |
|---|---|---|
| Show the data | Scatterplot showing all individual points | Only showing summary statistics |
| Focus on substance | Clear axis labels, minimal decoration | 3D effects, excessive colors, chartjunk |
| Avoid distortion | Y-axis starts at zero for bar charts | Truncated Y-axis to exaggerate differences |
| Maximize data-ink ratio | Remove unnecessary gridlines and borders | Heavy gridlines, decorative backgrounds |
| Make data coherent | Group related data with color or facets | Too many categories without grouping |
| Encourage comparison | Side-by-side plots or overlaid lines | Separate plots on different scales |
| Reveal multiple levels | Interactive plots or small multiples | Single aggregated view only |
Field et al. [1, Ch. 4] highlight common pitfalls:
1. Misleading Scales
# Example: Truncated Y-axis exaggerates differences
set.seed(seeds[2])
sales_data <- data.frame(
Quarter = c("Q1", "Q2", "Q3", "Q4"),
Sales = c(95, 97, 96, 98)
)
par(mfrow = c(1, 2))
# BAD: Truncated axis
barplot(sales_data$Sales, names.arg = sales_data$Quarter,
main = "BAD: Truncated Y-axis\n(Exaggerates differences)",
ylab = "Sales (millions)", ylim = c(94, 99),
col = "lightcoral", border = "darkred")
# GOOD: Full scale
barplot(sales_data$Sales, names.arg = sales_data$Quarter,
main = "GOOD: Full Y-axis\n(Shows true scale)",
ylab = "Sales (millions)", ylim = c(0, 100),
col = "lightgreen", border = "darkgreen")Interpretation: The left plot (truncated Y-axis) makes Q4 sales appear dramatically higher than Q1, when in reality the difference is only 3% [1, Ch. 4].
2. 3D Charts and Chartjunk
Field et al. [1, Ch. 4] warn against unnecessary 3D effects, which: - Distort perception of magnitudes - Make precise value reading difficult - Add visual clutter without information
3. Overlays and Pattern Confusion
When building complex ggplot2 visualizations,
stack your code across multiple lines [1, Ch. 4]. This
improves readability and makes troubleshooting easier.
# BAD: Everything on one line (hard to read and debug)
ggplot(data, aes(x, y, color = group)) + geom_point() + geom_line() + xlab("X Label") + ylab("Y Label") + theme_minimal()
# GOOD: Stacked code (easy to read and modify)
ggplot(data, aes(x, y, color = group)) +
geom_point() +
geom_line() +
xlab("X Label") +
ylab("Y Label") +
theme_minimal()Why this matters: When you get an error, R will tell you which line failed. With stacked code, you can comment out layers one at a time to isolate the problem [1, Ch. 4].
What is the grammar of graphics? Wilkinson [3]
introduced a systematic way to describe the components of a graph.
Wickham [4] implemented this in ggplot2, which Field et
al. [1, Ch. 4] recommend as the standard for R visualization.
Every ggplot2 graph has three essential components:
ggplot2 builds graphs in layers [1, Ch.
4]:
ggplot()geom_*()stat_*()scale_*()coord_*()theme_*()# Demonstrate layered approach
data(mtcars)
# Start with base
p <- ggplot(mtcars, aes(x = wt, y = mpg))
# Layer 1: Just the base (empty plot)
p# Layer 4: Add labels
p +
geom_point() +
geom_smooth(method = "lm", se = FALSE) +
labs(title = "Weight vs. Fuel Efficiency",
x = "Weight (1000 lbs)",
y = "Miles per Gallon")Key insight [1, Ch. 4]: By separating data,
aesthetics, and geometries, ggplot2 allows you to quickly
try different visualizations of the same data by changing only the
geom_*() layer.
Field et al. [1, Ch. 4] recommend creating a custom theme for consistency across all your plots.
# Create a clean, professional theme
cleanup <- theme(
panel.grid.major = element_blank(), # Remove major gridlines
panel.grid.minor = element_blank(), # Remove minor gridlines
panel.background = element_blank(), # Remove background
axis.line.x = element_line(color = 'black'), # Black x-axis
axis.line.y = element_line(color = 'black'), # Black y-axis
legend.key = element_rect(fill = 'white'), # White legend background
text = element_text(size = 15) # Larger text for readability
)
# Test the theme
ggplot(mtcars, aes(x = wt, y = mpg)) +
geom_point(size = 3, color = "steelblue") +
labs(title = "Custom Theme Example",
x = "Weight (1000 lbs)",
y = "Miles per Gallon") +
cleanupPractical tip: Save this cleanup theme
in your R script and add it to every plot with + cleanup.
This ensures all your visualizations have a consistent, professional
appearance [1, Ch. 4].
Definition [1, Ch. 4]: A histogram plots: - X-axis: Continuous score bins - Y-axis: Frequency (count) of observations in each bin
Why histograms matter: They help you identify: 1. Shape of the distribution (normal, skewed, bimodal) 2. Spread or variation in scores 3. Outliers or unusual values 4. Gaps in the data
Field et al. [1, Ch. 4] use a whimsical example to test the Disney philosophy: “Wishing upon a star makes all your dreams come true.”
Study design: - 250 participants randomly assigned to either: - Wish upon a star for 5 years - Work as hard as they can for 5 years - Outcome: Success measured on a 0–4 scale (0 = complete failure, 4 = complete success) - Measurement: Success assessed at baseline (pre) and after 5 years (post)
# Create example data (simulating the Jiminy Cricket study)
set.seed(seeds[3])
cricket <- data.frame(
ID = 1:250,
Strategy = rep(c("Wish upon a star", "Work hard"), each = 125),
Success_Pre = c(
rnorm(125, mean = 2.0, sd = 0.8), # Wish group baseline
rnorm(125, mean = 2.0, sd = 0.8) # Work group baseline
),
Success_Post = c(
rnorm(125, mean = 2.1, sd = 0.9), # Wish group post (minimal change)
rnorm(125, mean = 3.2, sd = 0.7) # Work group post (substantial improvement)
)
)
# Constrain to 0-4 scale
cricket$Success_Pre <- pmin(pmax(cricket$Success_Pre, 0), 4)
cricket$Success_Post <- pmin(pmax(cricket$Success_Post, 0), 4)
head(cricket) %>%
kable(caption = "Jiminy Cricket Dataset (First 6 Rows)", digits = 2) %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| ID | Strategy | Success_Pre | Success_Post |
|---|---|---|---|
| 1 | Wish upon a star | 1.01 | 2.08 |
| 2 | Wish upon a star | 1.86 | 1.65 |
| 3 | Wish upon a star | 1.74 | 2.41 |
| 4 | Wish upon a star | 1.79 | 3.21 |
| 5 | Wish upon a star | 0.59 | 2.70 |
| 6 | Wish upon a star | 2.14 | 3.09 |
# Step 1: Define the base plot
crickethist <- ggplot(data = cricket, aes(x = Success_Pre))
# Step 2: Add histogram layer
crickethist +
geom_histogram()What happened? ggplot2 automatically
chose 30 bins and warned us. Let’s customize this [1, Ch. 4].
# Better: Specify bin width explicitly
crickethist +
geom_histogram(binwidth = 0.5, color = 'purple', fill = 'magenta') +
xlab("Success Pre-Test Score") +
ylab("Frequency") +
cleanupInterpretation: Most participants started with success scores around 2.0 (the middle of the scale), with roughly symmetric distribution [1, Ch. 4].
Field et al. [1, Ch. 4] use a memorable example: festival attendees rating their hygiene on each day (0 = “eau de toilet”, 4 = “eau de toilette”).
# Create festival hygiene data
set.seed(seeds[4])
festival <- data.frame(
ticknumb = 1:200,
day1 = rnorm(200, mean = 3.5, sd = 0.5),
day2 = rnorm(200, mean = 2.8, sd = 0.7),
day3 = rnorm(200, mean = 1.8, sd = 0.9)
)
# Constrain to 0-4 scale
festival$day1 <- pmin(pmax(festival$day1, 0), 4)
festival$day2 <- pmin(pmax(festival$day2, 0), 4)
festival$day3 <- pmin(pmax(festival$day3, 0), 4)
head(festival) %>%
kable(caption = "Festival Hygiene Dataset (First 6 Rows)", digits = 2) %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| ticknumb | day1 | day2 | day3 |
|---|---|---|---|
| 1 | 2.93 | 3.22 | 1.87 |
| 2 | 3.11 | 3.22 | 1.07 |
| 3 | 4.00 | 1.41 | 2.82 |
| 4 | 2.73 | 3.59 | 2.67 |
| 5 | 3.39 | 1.40 | 2.45 |
| 6 | 4.00 | 2.10 | 0.83 |
# Histogram of Day 1 hygiene
festivalhist <- ggplot(data = festival, aes(x = day1))
festivalhist +
geom_histogram(binwidth = 0.5, color = 'blue', fill = 'lightblue') +
xlab("Day 1 Hygiene Score") +
ylab("Frequency") +
cleanupInterpretation: On Day 1, most festival-goers rated their hygiene highly (scores around 3.5), suggesting they arrived clean [1, Ch. 4].
# Add density curve to see distribution shape more clearly
festivalhist +
geom_histogram(aes(y = ..density..), binwidth = 0.5,
color = 'blue', fill = 'lightblue', alpha = 0.7) +
geom_density(color = 'darkblue', linewidth = 1.5) +
xlab("Day 1 Hygiene Score") +
ylab("Density") +
cleanupWhy density curves? They smooth out the histogram’s jaggedness and make it easier to assess normality [1, Ch. 4].
Definition [1, Ch. 4]: A scatterplot displays: - X-axis: One continuous variable (predictor) - Y-axis: Another continuous variable (outcome) - Points: Each observation plotted at (x, y) coordinates
Why scatterplots matter: They reveal: 1. Direction of relationship (positive, negative, none) 2. Strength of relationship (tight cluster vs. wide scatter) 3. Form of relationship (linear, curved, no pattern) 4. Outliers that might distort correlations or regressions
Field et al. [1, Ch. 4] use a dataset of 103 students who rated their exam anxiety, time spent revising, exam performance, and gender.
# Create exam anxiety dataset
set.seed(seeds[5])
exam <- data.frame(
Code = 1:103,
Revise = runif(103, min = 5, max = 30), # Hours of revision
Exam = rnorm(103, mean = 60, sd = 15), # Exam score
Anxiety = rnorm(103, mean = 50, sd = 20), # Anxiety score
Gender = sample(1:2, 103, replace = TRUE)
)
# Create negative correlation between anxiety and exam score
exam$Exam <- 80 - 0.4 * exam$Anxiety + rnorm(103, mean = 0, sd = 10)
exam$Exam <- pmin(pmax(exam$Exam, 0), 100) # Constrain to 0-100
# Factor gender
exam$Gender <- factor(exam$Gender, levels = c(1, 2), labels = c("Male", "Female"))
head(exam) %>%
kable(caption = "Exam Anxiety Dataset (First 6 Rows)", digits = 1) %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Code | Revise | Exam | Anxiety | Gender |
|---|---|---|---|---|
| 1 | 25.3 | 62.0 | 60.7 | Female |
| 2 | 11.2 | 52.4 | 54.8 | Female |
| 3 | 20.8 | 69.6 | 36.8 | Female |
| 4 | 14.0 | 74.7 | 8.1 | Female |
| 5 | 17.3 | 60.9 | 45.1 | Male |
| 6 | 29.8 | 55.6 | 74.4 | Male |
# Basic scatterplot
scatter <- ggplot(exam, aes(x = Anxiety, y = Exam))
scatter +
geom_point(size = 3, color = "steelblue", alpha = 0.7) +
xlab("Anxiety Score") +
ylab("Exam Score (%)") +
cleanupInterpretation: There appears to be a negative relationship—as anxiety increases, exam scores tend to decrease [1, Ch. 4].
# Add linear regression line with confidence band
scatter +
geom_point(size = 3, color = "steelblue", alpha = 0.7) +
geom_smooth(method = 'lm', formula = y ~ x,
color = 'black', fill = 'lightblue') +
xlab('Anxiety Score') +
ylab('Exam Score (%)') +
cleanupInterpretation: The regression line confirms the negative trend. The shaded band shows the 95% confidence interval—wider at the extremes where we have fewer data points [1, Ch. 4].
Question: Does the anxiety-performance relationship differ by gender?
# Scatterplot with groups
scatter2 <- ggplot(exam, aes(x = Anxiety, y = Exam,
color = Gender, fill = Gender))
scatter2 +
geom_point(size = 3, alpha = 0.7) +
geom_smooth(method = "lm", formula = y ~ x, alpha = 0.2) +
xlab("Anxiety Score") +
ylab("Exam Score (%)") +
cleanup +
scale_fill_manual(name = "Gender of Participant",
labels = c("Men", "Women"),
values = c("purple", "grey")) +
scale_color_manual(name = "Gender of Participant",
labels = c("Men", "Women"),
values = c("purple", "grey10"))Interpretation: Both genders show negative anxiety-performance relationships, with similar slopes. This suggests the effect of anxiety on exam performance is consistent across genders [1, Ch. 4].
For exploring multiple relationships simultaneously, Field et al. [1, Ch. 4] recommend scatterplot matrices.
library(GGally)
ggpairs(data = exam[, c("Revise", "Anxiety", "Exam", "Gender")],
title = "Exam Anxiety: Scatterplot Matrix",
mapping = aes(color = Gender, alpha = 0.5))Interpretation: - Diagonal: Distributions of each variable by gender - Lower triangle: Scatterplots showing bivariate relationships - Upper triangle: Correlation coefficients
This reveals that revision time has a weak positive correlation with exam scores, while anxiety has a strong negative correlation [1, Ch. 4].
Wide format: Each participant gets one row; repeated measures are in separate columns [1, Ch. 4].
Long format: Each measurement gets one row; a variable indicates which measure it is [1, Ch. 4].
# Wide format example
wide_data <- data.frame(
ID = 1:5,
Baseline = c(10, 12, 11, 13, 10),
Month_6 = c(15, 17, 14, 18, 16),
Month_12 = c(20, 22, 19, 23, 21)
)
wide_data %>%
kable(caption = "Wide Format: One Row per Participant") %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| ID | Baseline | Month_6 | Month_12 |
|---|---|---|---|
| 1 | 10 | 15 | 20 |
| 2 | 12 | 17 | 22 |
| 3 | 11 | 14 | 19 |
| 4 | 13 | 18 | 23 |
| 5 | 10 | 16 | 21 |
# Convert to long format
long_data <- wide_data %>%
pivot_longer(cols = c(Baseline, Month_6, Month_12),
names_to = "Time",
values_to = "Score")
long_data %>%
head(10) %>%
kable(caption = "Long Format: One Row per Measurement") %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| ID | Time | Score |
|---|---|---|
| 1 | Baseline | 10 |
| 1 | Month_6 | 15 |
| 1 | Month_12 | 20 |
| 2 | Baseline | 12 |
| 2 | Month_6 | 17 |
| 2 | Month_12 | 22 |
| 3 | Baseline | 11 |
| 3 | Month_6 | 14 |
| 3 | Month_12 | 19 |
| 4 | Baseline | 13 |
Why reshape? ggplot2 requires long
format for most visualizations because it maps variables to
aesthetics—you need a single column for the outcome and a factor column
to indicate groups [1, Ch. 4].
# Reshape to long format
longcricket <- cricket %>%
pivot_longer(cols = c(Success_Pre, Success_Post),
names_to = "Time",
values_to = "Score") %>%
mutate(Time = factor(Time,
levels = c("Success_Pre", "Success_Post"),
labels = c("Baseline", "5 Years")))
longcricket %>%
head(10) %>%
kable(caption = "Long Format Cricket Data", digits = 2) %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| ID | Strategy | Time | Score |
|---|---|---|---|
| 1 | Wish upon a star | Baseline | 1.01 |
| 1 | Wish upon a star | 5 Years | 2.08 |
| 2 | Wish upon a star | Baseline | 1.86 |
| 2 | Wish upon a star | 5 Years | 1.65 |
| 3 | Wish upon a star | Baseline | 1.74 |
| 3 | Wish upon a star | 5 Years | 2.41 |
| 4 | Wish upon a star | Baseline | 1.79 |
| 4 | Wish upon a star | 5 Years | 3.21 |
| 5 | Wish upon a star | Baseline | 0.59 |
| 5 | Wish upon a star | 5 Years | 2.70 |
Definition [1, Ch. 4]: A bar graph displays: - X-axis: Categorical groups - Y-axis: Mean of a continuous outcome - Bars: Height represents the group mean - Error bars: Show precision of the mean (confidence interval, standard error, or standard deviation)
Why bar graphs matter: They allow quick visual comparison of means across groups, with error bars indicating whether differences are likely meaningful [1, Ch. 4].
Field et al. [1, Ch. 4] discuss three common types:
Recommendation: Use 95% confidence intervals for inferential purposes—non-overlapping CIs suggest significant differences [1, Ch. 4].
Field et al. [1, Ch. 4] describe a study testing whether “chick flicks” exist:
Study design: - 20 men and 20 women randomly assigned to watch either: - Bridget Jones’s Diary (presumed “chick flick”) - Memento (control movie) - Outcome: Physiological arousal (higher = more enjoyment)
# Create chick flick dataset
set.seed(seeds[6])
chickflick <- data.frame(
gender = rep(1:2, each = 20),
film = rep(rep(1:2, each = 10), 2),
arousal = c(
rnorm(10, mean = 15, sd = 5), # Men watching Bridget Jones
rnorm(10, mean = 18, sd = 5), # Men watching Memento
rnorm(10, mean = 30, sd = 6), # Women watching Bridget Jones
rnorm(10, mean = 17, sd = 5) # Women watching Memento
)
)
# Factor variables
chickflick$gender <- factor(chickflick$gender,
levels = c(1, 2),
labels = c("Male", "Female"))
chickflick$film <- factor(chickflick$film,
levels = c(1, 2),
labels = c("Bridget Jones", "Memento"))
head(chickflick) %>%
kable(caption = "Chick Flick Dataset (First 6 Rows)", digits = 1) %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| gender | film | arousal |
|---|---|---|
| Male | Bridget Jones | 17.0 |
| Male | Bridget Jones | 13.9 |
| Male | Bridget Jones | 15.4 |
| Male | Bridget Jones | 10.2 |
| Male | Bridget Jones | 13.2 |
| Male | Bridget Jones | 26.8 |
# Bar chart showing mean arousal by film
chickbar <- ggplot(chickflick, aes(x = film, y = arousal))
chickbar +
stat_summary(fun = mean,
geom = "bar",
fill = "White",
color = "Black") +
stat_summary(fun.data = mean_cl_normal,
geom = "errorbar",
width = 0.2) +
xlab("Film Watched") +
ylab("Arousal Level") +
cleanupInterpretation: On average, participants showed similar arousal levels for both films when gender is ignored. The error bars overlap substantially, suggesting no significant difference [1, Ch. 4].
# Bar chart with gender as second factor
chickbar2 <- ggplot(chickflick, aes(x = film, y = arousal, fill = gender))
chickbar2 +
stat_summary(fun = mean,
geom = "bar",
position = "dodge") +
stat_summary(fun.data = mean_cl_normal,
geom = "errorbar",
position = position_dodge(width = 0.90),
width = 0.2) +
xlab("Film Watched") +
ylab("Arousal Level") +
cleanup +
scale_fill_manual(name = "Gender of Participant",
labels = c("Men", "Women"),
values = c("gray30", "gray70"))Interpretation: Now the pattern is clear! Women showed much higher arousal watching Bridget Jones than Memento, while men showed similar arousal for both films. This supports the “chick flick” hypothesis [1, Ch. 4].
Key insight: The interaction between gender and film was hidden when we ignored gender. This demonstrates why visualizing data by subgroups is crucial [1, Ch. 4].
Field et al. [1, Ch. 4] recommend line graphs when:
Why lines instead of bars? Lines emphasize continuity and trends, making them ideal for time-series or dose-response data [1, Ch. 4].
Field et al. [1, Ch. 4] describe a study testing four hiccup cures:
Study design: - 15 participants tried four interventions: - Baseline (no treatment) - Tongue pulling for 1 minute - Carotid artery massage - Digital rectal massage (yes, really!) - Outcome: Number of hiccups in the minute after each procedure
# Create hiccups dataset
set.seed(seeds[7])
hiccups <- data.frame(
Participant = 1:15,
Baseline = rpois(15, lambda = 15),
Tongue = rpois(15, lambda = 12),
Carotid = rpois(15, lambda = 8),
Other = rpois(15, lambda = 5)
)
head(hiccups) %>%
kable(caption = "Hiccups Dataset (First 6 Rows)") %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Participant | Baseline | Tongue | Carotid | Other |
|---|---|---|---|---|
| 1 | 15 | 9 | 6 | 4 |
| 2 | 20 | 17 | 8 | 4 |
| 3 | 17 | 10 | 6 | 4 |
| 4 | 12 | 17 | 9 | 6 |
| 5 | 16 | 7 | 10 | 4 |
| 6 | 12 | 10 | 7 | 5 |
# Reshape to long format
longhiccups <- hiccups %>%
pivot_longer(cols = c(Baseline, Tongue, Carotid, Other),
names_to = "Intervention",
values_to = "Hiccups") %>%
mutate(Intervention = factor(Intervention,
levels = c("Baseline", "Tongue", "Carotid", "Other")))
longhiccups %>%
head(10) %>%
kable(caption = "Long Format Hiccups Data") %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Participant | Intervention | Hiccups |
|---|---|---|
| 1 | Baseline | 15 |
| 1 | Tongue | 9 |
| 1 | Carotid | 6 |
| 1 | Other | 4 |
| 2 | Baseline | 20 |
| 2 | Tongue | 17 |
| 2 | Carotid | 8 |
| 2 | Other | 4 |
| 3 | Baseline | 17 |
| 3 | Tongue | 10 |
# Line graph of mean hiccups by intervention
hiccupline <- ggplot(longhiccups, aes(x = Intervention, y = Hiccups))
hiccupline +
stat_summary(fun = mean, geom = "point", size = 4) +
stat_summary(fun = mean, geom = "line", aes(group = 1), linewidth = 1) +
stat_summary(fun.data = mean_cl_normal, geom = "errorbar", width = 0.2) +
xlab("Intervention Type") +
ylab("Number of Hiccups") +
cleanupInterpretation: All three interventions reduced
hiccups compared to baseline, with the “Other” intervention (digital
rectal massage) being most effective. The aes(group = 1)
tells ggplot2 to connect all points with a single line [1,
Ch. 4].
Field et al. [1, Ch. 4] describe a study testing whether text messaging harms grammar:
Study design: - 50 children randomly assigned to: - Text messaging allowed (n = 25) - Text messaging forbidden (n = 25) - Outcome: Grammar test score (% correct) at baseline and 6 months
# Create texting dataset
set.seed(seeds[8])
texting <- data.frame(
Group = rep(1:2, each = 25),
Baseline = c(
rnorm(25, mean = 65, sd = 10), # Texting allowed
rnorm(25, mean = 65, sd = 10) # No texting
),
Six_months = c(
rnorm(25, mean = 60, sd = 12), # Texting allowed (decline)
rnorm(25, mean = 70, sd = 10) # No texting (improvement)
)
)
# Factor group
texting$Group <- factor(texting$Group,
levels = c(1, 2),
labels = c("Texting Allowed", "No Texting Allowed"))
head(texting) %>%
kable(caption = "Texting Dataset (First 6 Rows)", digits = 1) %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Group | Baseline | Six_months |
|---|---|---|
| Texting Allowed | 67.2 | 76.0 |
| Texting Allowed | 64.8 | 67.8 |
| Texting Allowed | 57.0 | 63.5 |
| Texting Allowed | 89.0 | 64.7 |
| Texting Allowed | 68.4 | 49.2 |
| Texting Allowed | 43.4 | 45.7 |
# Reshape to long format
longtexting <- texting %>%
pivot_longer(cols = c(Baseline, Six_months),
names_to = "Time",
values_to = "Grammar_Score") %>%
mutate(Time = factor(Time,
levels = c("Baseline", "Six_months"),
labels = c("Baseline", "Six Months")))
longtexting %>%
head(10) %>%
kable(caption = "Long Format Texting Data", digits = 1) %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Group | Time | Grammar_Score |
|---|---|---|
| Texting Allowed | Baseline | 67.2 |
| Texting Allowed | Six Months | 76.0 |
| Texting Allowed | Baseline | 64.8 |
| Texting Allowed | Six Months | 67.8 |
| Texting Allowed | Baseline | 57.0 |
| Texting Allowed | Six Months | 63.5 |
| Texting Allowed | Baseline | 89.0 |
| Texting Allowed | Six Months | 64.7 |
| Texting Allowed | Baseline | 68.4 |
| Texting Allowed | Six Months | 49.2 |
# Line graph with two factors
textline <- ggplot(longtexting, aes(x = Time, y = Grammar_Score, color = Group))
textline +
stat_summary(fun = mean, geom = "point", size = 4) +
stat_summary(fun = mean, geom = "line", aes(group = Group), linewidth = 1) +
stat_summary(fun.data = mean_cl_normal, geom = "errorbar", width = 0.2) +
xlab("Measurement Time") +
ylab("Mean Grammar Score (%)") +
cleanup +
scale_color_manual(name = "Texting Option",
labels = c("All the texts", "None of the texts"),
values = c("black", "grey50")) +
scale_x_discrete(labels = c("Baseline", "Six Months"))Interpretation: This reveals a clear interaction effect [1, Ch. 4]: - Texting allowed group: Grammar scores declined from baseline to 6 months - No texting group: Grammar scores improved over the same period
The diverging lines indicate that the effect of time depends on the texting condition—strong evidence that text messaging harms grammar development [1, Ch. 4].
Field et al. [1, Ch. 4] emphasize choosing colors that: 1. Are distinguishable by colorblind individuals 2. Print well in grayscale 3. Have sufficient contrast
# Compare color schemes
library(RColorBrewer)
# Display colorblind-friendly palettes
display.brewer.all(colorblindFriendly = TRUE)# Example with colorblind-friendly palette
ggplot(chickflick, aes(x = film, y = arousal, fill = gender)) +
stat_summary(fun = mean, geom = "bar", position = "dodge") +
stat_summary(fun.data = mean_cl_normal, geom = "errorbar",
position = position_dodge(0.9), width = 0.2) +
scale_fill_brewer(palette = "Set2", name = "Gender") +
xlab("Film") +
ylab("Arousal Level") +
cleanupWhat is faceting? Creating multiple subplots for different groups [1, Ch. 4].
# Faceted histogram by gender
ggplot(exam, aes(x = Exam, fill = Gender)) +
geom_histogram(binwidth = 5, color = "white", alpha = 0.8) +
facet_wrap(~ Gender, ncol = 1) +
xlab("Exam Score (%)") +
ylab("Frequency") +
cleanup +
theme(legend.position = "none")Why faceting? It avoids overplotting and makes comparisons clearer when you have many groups [1, Ch. 4].
# Save as PNG (for presentations)
ggsave("figure1.png", width = 8, height = 5, dpi = 300)
# Save as PDF (for publications)
ggsave("figure1.pdf", width = 8, height = 5)
# Save as SVG (for vector editing)
ggsave("figure1.svg", width = 8, height = 5)Recommendation [1, Ch. 4]: Use 300 dpi for print publications, 150 dpi for web/presentations.
# Add annotations to highlight key findings
ggplot(exam, aes(x = Anxiety, y = Exam)) +
geom_point(size = 3, alpha = 0.6, color = "steelblue") +
geom_smooth(method = "lm", se = TRUE, color = "black") +
annotate("text", x = 80, y = 80,
label = "High anxiety\n→ Lower scores",
size = 5, color = "darkred") +
annotate("segment", x = 75, xend = 85, y = 75, yend = 55,
arrow = arrow(length = unit(0.3, "cm")), color = "darkred") +
xlab("Anxiety Score") +
ylab("Exam Score (%)") +
cleanupField et al. [1, Ch. 4] recommend this workflow:
| Step | Action | Key_Function |
|---|---|---|
| 1 | Explore data with summary statistics | summary(), str(), head() |
| 2 | Check for missing values and outliers | is.na(), boxplot() |
| 3 | Choose appropriate plot type for your question | See Table 9.1 |
| 4 | Create basic plot with ggplot2 | ggplot(data, aes(…)) |
| 5 | Add layers (error bars, regression lines, etc.) |
|
| 6 | Customize aesthetics (colors, labels, theme) |
|
| 7 | Check assumptions visually (normality, linearity) | Q-Q plots, residual plots |
| 8 | Save high-quality figure with ggsave() | ggsave(‘file.png’, dpi=300) |
| Data_Structure | Research_Question | Recommended_Plot | ggplot2_Geom |
|---|---|---|---|
| One continuous variable | What is the distribution shape? | Histogram + density curve | geom_histogram() + geom_density() |
| One continuous variable | Are there outliers? | Boxplot | geom_boxplot() |
| One categorical variable | What are the frequencies? | Bar chart (counts) | geom_bar() |
| Two continuous variables | Is there a relationship? | Scatterplot ± regression line | geom_point() + geom_smooth() |
| Continuous outcome by categorical group | Do group means differ? | Bar chart with error bars | stat_summary(geom=‘bar’) |
| Continuous outcome by categorical group | Do group means differ? | Boxplot or violin plot | geom_boxplot() / geom_violin() |
| Repeated measures / time series | How do values change over time? | Line graph with error bars | stat_summary(geom=‘line’) |
| Multiple continuous variables | What are pairwise relationships? | Scatterplot matrix (GGally::ggpairs) | GGally::ggpairs() |
# ❌ MISTAKE 1: Forgetting aes() for variable mappings
ggplot(data, x = variable, y = variable2) + geom_point()
# ✅ FIX: Wrap variables in aes()
ggplot(data, aes(x = variable, y = variable2)) + geom_point()
# ❌ MISTAKE 2: Using = instead of == in filters
data %>% filter(group = "A")
# ✅ FIX: Use == for logical comparison
data %>% filter(group == "A")
# ❌ MISTAKE 3: Forgetting group aesthetic for line plots
ggplot(data, aes(x = time, y = score, color = group)) +
stat_summary(fun = mean, geom = "line")
# ✅ FIX: Add group aesthetic
ggplot(data, aes(x = time, y = score, color = group)) +
stat_summary(fun = mean, geom = "line", aes(group = group))
# ❌ MISTAKE 4: Not reshaping data to long format
# Wide format won't work for most ggplot2 visualizations
# ✅ FIX: Use pivot_longer() or melt()
long_data <- wide_data %>%
pivot_longer(cols = c(var1, var2), names_to = "variable", values_to = "value")Based on Field et al. [1, Ch. 4]:
Using the mtcars dataset:
mpg with appropriate bin
width# Your code here
library(moments)
# Histogram with density
ggplot(mtcars, aes(x = mpg)) +
geom_histogram(aes(y = ..density..), binwidth = 2,
fill = "lightblue", color = "white") +
geom_density(color = "darkblue", linewidth = 1.5) +
xlab("Miles per Gallon") +
ylab("Density") +
cleanup
# Distribution statistics
cat("Skewness:", skewness(mtcars$mpg), "\n")
cat("Kurtosis:", kurtosis(mtcars$mpg), "\n")Using the iris dataset:
Sepal.Length
vs. Sepal.WidthSpeciesUsing the chickflick dataset:
# Your code here
ggplot(chickflick, aes(x = film, y = arousal)) +
stat_summary(fun = mean, geom = "bar", fill = "steelblue") +
stat_summary(fun.data = mean_cl_normal, geom = "errorbar", width = 0.2) +
scale_x_discrete(labels = c("Chick Flick", "Action Film")) +
xlab("Film Type") +
ylab("Mean Arousal Level") +
cleanupUsing the longtexting dataset:
# Your code here
ggplot(longtexting, aes(x = Time, y = Grammar_Score, color = Group)) +
stat_summary(fun = mean, geom = "point", size = 4) +
stat_summary(fun = mean, geom = "line", aes(group = Group), linewidth = 1.5) +
stat_summary(fun.data = mean_cl_normal, geom = "errorbar", width = 0.2) +
scale_color_brewer(palette = "Set1") +
labs(title = "Text Messaging Harms Grammar Development",
subtitle = "Interaction between time and texting condition",
x = "Time Point",
y = "Grammar Score (%)") +
cleanupAesthetic (aes): Mapping from data variables to visual properties (x, y, color, size, shape) [1, Ch. 4].
Bar chart: Graph displaying means of a continuous outcome across categorical groups, typically with error bars [1, Ch. 4].
Binwidth: Width of bins in a histogram; smaller values show more detail but may appear noisy [1, Ch. 4].
Chartjunk: Unnecessary decorative elements that distract from data (3D effects, excessive colors, patterns) [2].
Confidence interval (CI): Range of values likely containing the true population parameter; 95% CIs are standard [1, Ch. 2].
Density curve: Smoothed representation of a distribution’s shape, useful for assessing normality [1, Ch. 4].
Error bar: Visual representation of uncertainty (SD, SE, or CI) displayed on bar charts and line graphs [1, Ch. 4].
Faceting: Creating multiple subplots for different
groups using facet_wrap() or facet_grid() [1,
Ch. 4].
Geometry (geom): The type of visual representation (points, lines, bars) in a ggplot2 layer [1, Ch. 4].
Grammar of graphics: Systematic framework for describing graph components (data, aesthetics, geometries) [3, 4].
Histogram: Graph showing the frequency distribution of a continuous variable [1, Ch. 4].
Interaction effect: When the effect of one variable depends on the level of another variable; revealed by non-parallel lines [1, Ch. 4].
Layer: Building block of ggplot2 graphs; each layer adds a visual or statistical element [1, Ch. 4].
Line graph: Graph connecting means over time or ordered conditions, emphasizing trends [1, Ch. 4].
Long format: Data structure where each measurement gets its own row; required for most ggplot2 visualizations [1, Ch. 4].
Overplotting: When many points overlap, obscuring the true density; solved with transparency (alpha) or jittering [1, Ch. 4].
Scatterplot: Graph showing the relationship between two continuous variables [1, Ch. 4].
Scatterplot matrix: Grid of scatterplots showing pairwise relationships among multiple variables [1, Ch. 4].
Standard error (SE): Standard deviation of the sampling distribution; measures precision of the mean [1, Ch. 2].
Theme: Non-data elements of a plot (fonts,
gridlines, backgrounds); customized with theme() [1, Ch.
4].
Wide format: Data structure where each participant gets one row and repeated measures are in separate columns [1, Ch. 4].
[1] A. Field, J. Miles, and Z. Field, Discovering Statistics Using R. London: SAGE Publications, 2012.
[2] E. R. Tufte, The Visual Display of Quantitative Information, 2nd ed. Cheshire, CT: Graphics Press, 2001.
[3] L. Wilkinson, The Grammar of Graphics, 2nd ed. New York: Springer, 2005.
[4] H. Wickham, ggplot2: Elegant Graphics for Data Analysis, 2nd ed. New York: Springer, 2016.
[5] W. S. Cleveland, The Elements of Graphing Data, 2nd ed. Summit, NJ: Hobart Press, 1994.
## R version 4.5.2 (2025-10-31 ucrt)
## Platform: x86_64-w64-mingw32/x64
## Running under: Windows 11 x64 (build 26200)
##
## Matrix products: default
## LAPACK version 3.12.1
##
## locale:
## [1] LC_COLLATE=English_United States.utf8
## [2] LC_CTYPE=English_United States.utf8
## [3] LC_MONETARY=English_United States.utf8
## [4] LC_NUMERIC=C
## [5] LC_TIME=English_United States.utf8
##
## time zone: America/New_York
## tzcode source: internal
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] RColorBrewer_1.1-3 GGally_2.4.0 kableExtra_1.4.0 knitr_1.50
## [5] Hmisc_5.2-4 reshape_0.8.10 rio_1.2.4 lubridate_1.9.4
## [9] forcats_1.0.1 stringr_1.6.0 dplyr_1.1.4 purrr_1.2.0
## [13] readr_2.1.6 tidyr_1.3.1 tibble_3.3.0 ggplot2_4.0.1
## [17] tidyverse_2.0.0 seedhash_0.1.0
##
## loaded via a namespace (and not attached):
## [1] gtable_0.3.6 xfun_0.54 bslib_0.9.0 htmlwidgets_1.6.4
## [5] lattice_0.22-7 tzdb_0.5.0 vctrs_0.6.5 tools_4.5.2
## [9] generics_0.1.4 cluster_2.1.8.1 pkgconfig_2.0.3 Matrix_1.7-4
## [13] data.table_1.17.8 checkmate_2.3.3 S7_0.2.1 lifecycle_1.0.4
## [17] compiler_4.5.2 farver_2.1.2 textshaping_1.0.4 htmltools_0.5.8.1
## [21] sass_0.4.10 yaml_2.3.10 htmlTable_2.4.3 Formula_1.2-5
## [25] pillar_1.11.1 jquerylib_0.1.4 cachem_1.1.0 rpart_4.1.24
## [29] nlme_3.1-168 ggstats_0.11.0 tidyselect_1.2.1 digest_0.6.37
## [33] stringi_1.8.7 splines_4.5.2 labeling_0.4.3 fastmap_1.2.0
## [37] grid_4.5.2 colorspace_2.1-2 cli_3.6.5 magrittr_2.0.4
## [41] base64enc_0.1-3 foreign_0.8-90 withr_3.0.2 scales_1.4.0
## [45] backports_1.5.0 timechange_0.3.0 rmarkdown_2.30 nnet_7.3-20
## [49] gridExtra_2.3 hms_1.1.4 evaluate_1.0.5 viridisLite_0.4.2
## [53] mgcv_1.9-3 rlang_1.1.6 Rcpp_1.1.0 glue_1.8.0
## [57] xml2_1.5.1 svglite_2.2.2 rstudioapi_0.17.1 jsonlite_2.0.0
## [61] R6_2.6.1 plyr_1.8.9 systemfonts_1.3.1
End of Week 05: R for Data Analytics Tutorial
ANLY 500 - Analytics I
Harrisburg University