This Week 06 tutorial builds on our previous data analytics
foundations and focuses on the critical first step of any analysis:
Data Screening. Before running statistical models, we
must ensure our data is clean, accurate, and meets necessary
assumptions. This tutorial upgrades the lecture concepts from
06_datascreen_1.rmd into hands-on, code-first
workflows:
Reference: This tutorial draws heavily from Field, Miles, and Field [1], Discovering Statistics Using R, particularly the chapters on data screening and cleaning, with IEEE-style citations throughout.
Upgrade note: This hands-on tutorial extends the Week 06 lecture slides in
06_datascreen_1.rmdwith code-first workflows, beginner-friendly explanations, and theoretical grounding from Field et al. [1].
install.packages("X") once, then
library(X).By the end of this tutorial, you will be able to:
mice.# Install if needed
# install.packages(c("tidyverse", "rio", "mice", "VIM", "knitr", "kableExtra", "ggplot2", "gridExtra"))
library(tidyverse)
library(ggplot2) # For data visualization
library(gridExtra) # For arranging multiple plots
library(rio) # For data import
library(mice) # For multiple imputation
library(VIM) # For visualizing missing data
library(knitr)
library(kableExtra)
cat("\n=== REPRODUCIBLE SEED INFORMATION ===")##
## === REPRODUCIBLE SEED INFORMATION ===
##
## Generator Name: Week06 Data Screening - ANLY500
##
## MD5 Hash: 4476fe6d5a95542e0ac57b95797b6709
##
## Available Seeds: -803686227, -78650771, -915116207, -338502455, 574023479 ...
Data screening is the process of inspecting a dataset for errors, missing values, and outliers before analysis. As Field et al. note, screening is crucial because issues like outliers can “bias the mean and inflate the standard deviation” [1], and missing data can arise from various sources (e.g., participants accidentally missing questions) [1]. Field explicitly refers to Tabachnick and Fidell as the “definitive guide to screening data” [1].
Data screening is a key facet of any analysis. If the data is incorrect, the output will be useless. Screening ensures results are interpretable and valid. Our checks should always include:
Normally, our \(\alpha\) value is set to 0.05 (Type 1 Error), meaning we use \(p < .05\) as a criterion for statistical significance. However, in data screening, we want to use a stricter criterion to avoid over-cleaning or incorrectly flagging data. Therefore, we use \(p < .001\) to denote problematic issues [1].
We follow this specific order because each step can affect the next:
We will use a dataset that examines the resiliency (RS columns) of teenagers after a natural disaster. While we haven’t covered specific analyses yet, we can screen this data overall.
# Import the dataset
# Note: Ensure "data/data_screening.csv" exists in your working directory or adjust the path
# If the file is not available, we will simulate a similar structure for this tutorial
if(file.exists("data/data_screening.csv")) {
master <- import("data/data_screening.csv")
} else {
# Simulation for tutorial purposes if file is missing
set.seed(seeds[2])
n <- 100
master <- data.frame(
Sex = sample(c(1, 2, 3), n, replace = TRUE, prob = c(0.45, 0.45, 0.1)), # 3 is typo
SES = sample(c(1, 2, 3), n, replace = TRUE),
Grade = sample(8:14, n, replace = TRUE), # >12 is typo
Absences = sample(0:20, n, replace = TRUE), # >15 is typo
RS1 = sample(1:7, n, replace = TRUE),
RS2 = sample(1:7, n, replace = TRUE),
RS3 = sample(c(1:7, 99), n, replace = TRUE, prob = c(rep(0.14, 7), 0.02)) # 99 is outlier/typo
)
# Introduce some missingness
master[sample(1:n, 10), "RS1"] <- NA
master[sample(1:n, 5), "SES"] <- NA
}
str(master)## 'data.frame': 137 obs. of 20 variables:
## $ Sex : int 1 1 1 1 1 1 1 1 1 1 ...
## $ Age : int 17 16 13 15 16 15 12 14 13 18 ...
## $ SES : int 2 3 2 3 2 3 3 3 1 3 ...
## $ Grade : int 11 7 5 6 11 7 4 6 4 9 ...
## $ Absences: int 2 2 2 2 6 2 1 2 2 5 ...
## $ RS1 : int 6 7 5 5 7 7 1 2 6 7 ...
## $ RS2 : int 4 1 NA 5 4 7 6 3 6 6 ...
## $ RS3 : int 2 1 5 5 4 7 NA 2 7 5 ...
## $ RS4 : int 2 5 7 7 7 7 1 3 6 NA ...
## $ RS5 : int 4 7 4 5 4 7 4 2 1 6 ...
## $ RS6 : int 7 7 6 6 4 7 7 3 4 6 ...
## $ RS7 : int 7 7 7 5 7 7 7 3 6 10 ...
## $ RS8 : int 4 7 6 7 7 7 NA 3 6 7 ...
## $ RS9 : int 5 4 6 6 4 7 NA 2 2 7 ...
## $ RS10 : int 7 7 7 7 7 7 4 2 5 7 ...
## $ RS11 : int 4 1 NA 6 7 7 3 3 6 5 ...
## $ RS12 : int 7 7 5 6 4 7 7 3 6 6 ...
## $ RS13 : int 4 4 7 6 7 7 7 2 6 6 ...
## $ RS14 : int 7 7 6 6 4 7 7 3 2 6 ...
## $ Health : int 6 6 2 2 6 4 3 6 1 5 ...
We need to check for factor variables incorrectly imported as numbers
and out-of-range values. Let’s check Sex and
SES.
Before fixing, let’s visualize the Sex variable to see
the error. We expect only two categories (e.g., 1 and 2), but a third
bar might appear.
p1 <- ggplot(master, aes(x = factor(Sex))) +
geom_bar(fill = "steelblue", color = "black") +
labs(title = "Before Cleaning: Sex",
subtitle = "Note the invalid category '3'",
x = "Sex Code",
y = "Count") +
theme_minimal()
p1Interpretation: You can clearly see a bar for “3”. If our codebook says 1=Women and 2=Men, then “3” is a typo! This visual confirmation is faster than reading raw tables.
## Sex SES
## 1 64 9
## 2 72 56
## 3 1 72
Explanation: - apply(X, 2, table): This
function takes our data (X), looks at columns
(2), and runs the table() function on each
one. - It’s a quick way to get counts for multiple columns at once.
We can fix this by using factor() and excluding the bad
label. Values not listed in levels will automatically
become NA.
## Fix categorical labels and drop typos
notypos$Sex <- factor(notypos$Sex,
levels = c(1,2), # We intentionally omit 3
labels = c("Women", "Men"))
notypos$SES <- factor(notypos$SES,
levels = c(1,2, 3),
labels = c("Low", "Medium", "High"))
# Let's verify with a plot
p2 <- ggplot(notypos, aes(x = Sex)) +
geom_bar(fill = "forestgreen", color = "black") +
labs(title = "After Cleaning: Sex",
subtitle = "Invalid categories removed (became NA)",
x = "Sex",
y = "Count") +
theme_minimal()
grid.arrange(p1, p2, ncol = 2)Visual Check: The “Before” plot shows the error (3).
The “After” plot shows only valid categories (Women, Men). The typos
didn’t disappear; they became NA (missing values), which we
will deal with later.
Use summary() to check min/max values. For example, if
the RS scale runs from 1 to 7, any value outside this range (like 99) is
a typo.
## Sex Age SES Grade Absences
## Women:64 Min. :11.00 Low : 9 Min. : 1.000 Min. : 1.000
## Men :72 1st Qu.:13.00 Medium:56 1st Qu.: 4.000 1st Qu.: 2.000
## NA's : 1 Median :15.50 High :72 Median : 6.000 Median : 2.000
## Mean :15.05 Mean : 5.883 Mean : 3.358
## 3rd Qu.:17.00 3rd Qu.: 8.000 3rd Qu.: 5.000
## Max. :18.00 Max. :35.000 Max. :35.000
## NA's :5
## RS1 RS2 RS3 RS4 RS5
## Min. :1.000 Min. :1.000 Min. : 1.00 Min. :1.00 Min. :1.000
## 1st Qu.:4.000 1st Qu.:4.000 1st Qu.: 3.00 1st Qu.:3.00 1st Qu.:3.000
## Median :5.000 Median :5.000 Median : 4.00 Median :5.00 Median :4.000
## Mean :4.858 Mean :4.962 Mean : 4.42 Mean :4.55 Mean :4.448
## 3rd Qu.:6.000 3rd Qu.:6.000 3rd Qu.: 6.00 3rd Qu.:7.00 3rd Qu.:6.750
## Max. :7.000 Max. :7.000 Max. :18.00 Max. :7.00 Max. :7.000
## NA's :3 NA's :5 NA's :6 NA's :6 NA's :3
## RS6 RS7 RS8 RS9 RS10
## Min. :1 Min. : 1.000 Min. :1.000 Min. :1.000 Min. :1.000
## 1st Qu.:4 1st Qu.: 4.000 1st Qu.:3.000 1st Qu.:4.000 1st Qu.:4.000
## Median :5 Median : 5.000 Median :5.000 Median :5.000 Median :5.500
## Mean :5 Mean : 5.082 Mean :4.764 Mean :4.692 Mean :5.313
## 3rd Qu.:7 3rd Qu.: 7.000 3rd Qu.:7.000 3rd Qu.:6.000 3rd Qu.:7.000
## Max. :7 Max. :10.000 Max. :7.000 Max. :7.000 Max. :7.000
## NA's :7 NA's :3 NA's :10 NA's :7 NA's :3
## RS11 RS12 RS13 RS14
## Min. :1.000 Min. :1.000 Min. : 1.000 Min. :1.00
## 1st Qu.:3.000 1st Qu.:5.000 1st Qu.: 4.000 1st Qu.:4.00
## Median :5.000 Median :6.000 Median : 6.000 Median :5.00
## Mean :4.641 Mean :5.552 Mean : 5.507 Mean :4.97
## 3rd Qu.:6.500 3rd Qu.:7.000 3rd Qu.: 7.000 3rd Qu.:7.00
## Max. :7.000 Max. :7.000 Max. :15.000 Max. :7.00
## NA's :6 NA's :3 NA's :3 NA's :5
## Health
## Min. :1.000
## 1st Qu.:2.000
## Median :4.000
## Mean :3.839
## 3rd Qu.:6.000
## Max. :7.000
##
Let’s look at RS3. We suspect a typo (99).
# Histogram to see distribution shape
p3 <- ggplot(notypos, aes(x = RS3)) +
geom_histogram(fill = "orange", color = "black", bins = 30) +
labs(title = "Histogram of RS3",
subtitle = "Note the gap and value near 100",
x = "RS3 Score", y = "Frequency") +
theme_minimal()
# Boxplot to see outliers clearly
p4 <- ggplot(notypos, aes(y = RS3)) +
geom_boxplot(fill = "orange", alpha = 0.5) +
labs(title = "Boxplot of RS3",
subtitle = "The dot at top is the extreme value",
y = "RS3 Score") +
theme_minimal()
grid.arrange(p3, p4, ncol = 2)Interpretation: - Histogram (Left): Most data is clustered 0-10. There is a tiny bar near 100. This is suspicious. - Boxplot (Right): The box is “squashed” at the bottom, and a single dot is way up at 100. This confirms we have an out-of-range value that needs fixing.
How do we fix these? 1. Find the original data and figure out what
the point should be. 2. Delete that data point (set to
NA) but not the entire participant.
Example: Using Logical Subsetting Suppose
Grade cannot be > 12 and Absences cannot be
> 15.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 4.000 6.000 5.883 8.000 35.000
# Code Explanation: "In the Grade column, find values > 12, and set them to NA"
notypos$Grade[ notypos$Grade > 12 ] <- NA
summary(notypos$Grade)## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1.000 4.000 6.000 5.669 8.000 11.000 1
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 2.000 2.000 3.358 5.000 35.000
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1.000 2.000 2.000 3.125 5.000 7.000 1
Example: Multiple Columns If multiple columns (e.g., RS items) have the same range (1-7), we can clean them all at once.
# Assuming RS columns are from column 5 onwards (adjust indices as needed)
# In our simulated data, RS columns are 5, 6, 7
rs_cols <- grep("RS", names(notypos))
# Check before
summary(notypos[ , rs_cols])## RS1 RS2 RS3 RS4 RS5
## Min. :1.000 Min. :1.000 Min. : 1.00 Min. :1.00 Min. :1.000
## 1st Qu.:4.000 1st Qu.:4.000 1st Qu.: 3.00 1st Qu.:3.00 1st Qu.:3.000
## Median :5.000 Median :5.000 Median : 4.00 Median :5.00 Median :4.000
## Mean :4.858 Mean :4.962 Mean : 4.42 Mean :4.55 Mean :4.448
## 3rd Qu.:6.000 3rd Qu.:6.000 3rd Qu.: 6.00 3rd Qu.:7.00 3rd Qu.:6.750
## Max. :7.000 Max. :7.000 Max. :18.00 Max. :7.00 Max. :7.000
## NA's :3 NA's :5 NA's :6 NA's :6 NA's :3
## RS6 RS7 RS8 RS9 RS10
## Min. :1 Min. : 1.000 Min. :1.000 Min. :1.000 Min. :1.000
## 1st Qu.:4 1st Qu.: 4.000 1st Qu.:3.000 1st Qu.:4.000 1st Qu.:4.000
## Median :5 Median : 5.000 Median :5.000 Median :5.000 Median :5.500
## Mean :5 Mean : 5.082 Mean :4.764 Mean :4.692 Mean :5.313
## 3rd Qu.:7 3rd Qu.: 7.000 3rd Qu.:7.000 3rd Qu.:6.000 3rd Qu.:7.000
## Max. :7 Max. :10.000 Max. :7.000 Max. :7.000 Max. :7.000
## NA's :7 NA's :3 NA's :10 NA's :7 NA's :3
## RS11 RS12 RS13 RS14
## Min. :1.000 Min. :1.000 Min. : 1.000 Min. :1.00
## 1st Qu.:3.000 1st Qu.:5.000 1st Qu.: 4.000 1st Qu.:4.00
## Median :5.000 Median :6.000 Median : 6.000 Median :5.00
## Mean :4.641 Mean :5.552 Mean : 5.507 Mean :4.97
## 3rd Qu.:6.500 3rd Qu.:7.000 3rd Qu.: 7.000 3rd Qu.:7.00
## Max. :7.000 Max. :7.000 Max. :15.000 Max. :7.00
## NA's :6 NA's :3 NA's :3 NA's :5
# Apply fix: any value > 7 in these columns becomes NA
notypos[ , rs_cols][ notypos[ , rs_cols] > 7 ] <- NA
# Check after
summary(notypos[ , rs_cols])## RS1 RS2 RS3 RS4 RS5
## Min. :1.000 Min. :1.000 Min. :1.000 Min. :1.00 Min. :1.000
## 1st Qu.:4.000 1st Qu.:4.000 1st Qu.:3.000 1st Qu.:3.00 1st Qu.:3.000
## Median :5.000 Median :5.000 Median :4.000 Median :5.00 Median :4.000
## Mean :4.858 Mean :4.962 Mean :4.315 Mean :4.55 Mean :4.448
## 3rd Qu.:6.000 3rd Qu.:6.000 3rd Qu.:6.000 3rd Qu.:7.00 3rd Qu.:6.750
## Max. :7.000 Max. :7.000 Max. :7.000 Max. :7.00 Max. :7.000
## NA's :3 NA's :5 NA's :7 NA's :6 NA's :3
## RS6 RS7 RS8 RS9 RS10
## Min. :1 Min. :1.000 Min. :1.000 Min. :1.000 Min. :1.000
## 1st Qu.:4 1st Qu.:4.000 1st Qu.:3.000 1st Qu.:4.000 1st Qu.:4.000
## Median :5 Median :5.000 Median :5.000 Median :5.000 Median :5.500
## Mean :5 Mean :5.008 Mean :4.764 Mean :4.692 Mean :5.313
## 3rd Qu.:7 3rd Qu.:7.000 3rd Qu.:7.000 3rd Qu.:6.000 3rd Qu.:7.000
## Max. :7 Max. :7.000 Max. :7.000 Max. :7.000 Max. :7.000
## NA's :7 NA's :5 NA's :10 NA's :7 NA's :3
## RS11 RS12 RS13 RS14
## Min. :1.000 Min. :1.000 Min. :1.000 Min. :1.00
## 1st Qu.:3.000 1st Qu.:5.000 1st Qu.:4.000 1st Qu.:4.00
## Median :5.000 Median :6.000 Median :6.000 Median :5.00
## Mean :4.641 Mean :5.552 Mean :5.364 Mean :4.97
## 3rd Qu.:6.500 3rd Qu.:7.000 3rd Qu.:7.000 3rd Qu.:7.00
## Max. :7.000 Max. :7.000 Max. :7.000 Max. :7.00
## NA's :6 NA's :3 NA's :5 NA's :5
Use summary() or apply() to count
NA values.
## Sex Age SES Grade Absences RS1 RS2 RS3
## 1 5 0 1 1 3 5 7
## RS4 RS5 RS6 RS7 RS8 RS9 RS10 RS11
## 6 3 7 5 10 7 3 6
## RS12 RS13 RS14 Health
## 3 5 5 0
Understanding why data is missing is crucial [1]:
Quick Fix: For simple statistics, you can often use
na.rm = TRUE(e.g.,mean(x, na.rm = TRUE)) to ignore missing values without permanently deleting them [1].
The VIM package helps visualize patterns.
Interpretation: - Left Bar Chart: Shows the proportion of missing data in each variable. - Right Grid: Shows the combinations of missingness. Blue is observed, Red is missing. - If you see red blocks lining up across multiple variables, it means people who missed Question A also tended to miss Question B.
Check the percentage of missing data per participant.
# Create a custom function to calculate % missing
percentmiss <- function(x){ sum(is.na(x))/length(x) * 100 }
# Apply it to rows (margin = 1)
missing <- apply(notypos, 1, percentmiss)
table(missing)## missing
## 0 5 10 15 20 70
## 108 17 4 4 1 3
We can create subsets: - Replaceable: Participants with \(\le 5\%\) missing data. - Not Replaceable: Participants with \(> 5\%\) missing data (often excluded or analyzed separately).
replace_rows <- subset(notypos, missing <= 5)
noreplace_rows <- subset(notypos, missing > 5)
cat("Original N:", nrow(notypos), "\n")## Original N: 137
## Rows to replace: 125
## Rows dropped/kept separate: 12
miceWe do not replace missing data for categorical/demographic variables (Sex, SES, Grade). We use them to help estimate missing values in other variables.
What is MICE? Multivariate Imputation by Chained Equations. It creates multiple “complete” datasets by guessing missing values based on relationships with other variables. It’s better than just plugging in the mean because it preserves the variance in the data.
# Run mice on our replaceable rows
# printFlag = FALSE keeps the output clean
temp_no_miss <- mice(replace_rows, printFlag = FALSE, seed = 123)Now, complete the dataset (pick one of the imputations, e.g., #1).
It is critical to verify that imputing data didn’t distort our variables. The distribution should look roughly the same before and after.
# Combine data for plotting
before <- data.frame(Value = replace_rows$RS1, Type = "Original (with NA)")
after <- data.frame(Value = all_columns$RS1, Type = "Imputed (Complete)")
comparison <- rbind(before, after)
ggplot(comparison, aes(x = Value, fill = Type)) +
geom_density(alpha = 0.5) +
labs(title = "Density Plot: RS1 Before vs. After Imputation",
subtitle = "Curves should be similar. Drastic changes indicate bad imputation.",
x = "RS1 Score",
y = "Density") +
theme_minimal() +
scale_fill_manual(values = c("blue", "red"))Interpretation: If the Red and Blue curves overlap significantly, your imputation is good! It means you filled in the holes without changing the “story” of the data.
Definition: A case with an extreme value on one or multiple variables. Why remove them? The mean is sensitive to outliers. We use a strict criterion (\(p < .001\)) to remove only very extreme scores.
For a single variable, we can use Z-scores. - Concept: A Z-score tells you how many standard deviations a score is from the mean. - \(Z = 0\): Exactly average. - \(Z = +1\): One standard deviation above average. - \(Z = +3.29\): Extremely high (less than 0.1% chance). - Criterion: \(Z > 3.29\) or \(Z < -3.29\) (corresponding to \(p < .001\)) [1].
# Calculate Z-scores for RS1 (as an example)
# scale() function calculates Z-scores
all_columns$RS1_z <- scale(all_columns$RS1)
# Check for outliers
summary(all_columns$RS1_z)## V1
## Min. :-2.15130
## 1st Qu.:-0.50153
## Median : 0.04839
## Mean : 0.00000
## 3rd Qu.: 1.14824
## Max. : 1.14824
ggplot(all_columns, aes(x = RS1_z)) +
geom_histogram(bins = 20, fill = "purple", color = "black") +
geom_vline(xintercept = c(-3.29, 3.29), linetype = "dashed", color = "red", size = 1) +
labs(title = "Histogram of Z-scores (RS1)",
subtitle = "Red dashed lines indicate the +/- 3.29 cutoff",
x = "Z-score",
y = "Count") +
theme_minimal()Interpretation: Any bars outside the red dashed lines represent outliers that should be removed or investigated.
For multiple continuous variables, we measure the distance of a case from the centroid (mean of means).
Concept: Imagine a cloud of points in 3D space. The “centroid” is the middle of the cloud. Mahalanobis distance measures how far a point is from this middle, accounting for the shape of the cloud.
## Degrees of Freedom: 4
## Cutoff Value: 18.46683
# Check how many outliers
# FALSE means NOT an outlier (Good), TRUE means outlier (Bad)
summary(mahal < cutoff)## Mode TRUE
## logical 125
Let’s plot the Mahalanobis distances to see who the “troublemakers” are.
# Create a data frame for plotting
outlier_plot <- data.frame(
ID = 1:length(mahal),
Distance = mahal,
Outlier = mahal > cutoff
)
ggplot(outlier_plot, aes(x = ID, y = Distance, color = Outlier)) +
geom_point(size = 3, alpha = 0.7) +
geom_hline(yintercept = cutoff, linetype = "dashed", color = "red") +
geom_text(aes(label = ifelse(Outlier, ID, "")), vjust = -0.5, show.legend = FALSE) +
labs(title = "Mahalanobis Distances",
subtitle = paste("Points above the red line (Cutoff =", round(cutoff, 2), ") are multivariate outliers"),
y = "Distance (D^2)",
x = "Participant ID") +
theme_minimal() +
scale_color_manual(values = c("black", "red"))Interpretation: - Points below the red line are safe. - Points above the red line are multivariate outliers. We’ve labeled their IDs so you can find them!
In this tutorial, we have covered the essential first steps of data analysis: 1. Accuracy: Checked for and fixed typos and impossible values. 2. Missing Data: Identified missingness patterns and used multiple imputation. 3. Outliers: Applied a strict statistical criterion (\(p < .001\)) to identify univariate and multivariate outliers.
Your dataset noout is now screened and ready for
assumption checking and analysis!
[1] A. Field, J. Miles, and Z. Field, Discovering Statistics Using R. London: SAGE Publications, 2012.