Machine: MELAW - Windows 10 x64
Welcome to your first week learning R, a powerful programming language for data analysis and statistical computing. This course is designed to help you become comfortable with R even if you’ve never programmed before. We’ll start with the fundamentals and build your skills step by step.
Key term – R: R is both a programming language and an environment for statistical computing and graphics. Originally developed by Ross Ihaka and Robert Gentleman at the University of Auckland, R has become one of the most widely used tools in data analytics (R Core Team, 2024).
By the end of this week, you will be able to:
R offers several advantages for data analytics:
Key concept – Reproducible Research: Field, Miles, and Field (2012, ch. 3) emphasize that reproducible research allows others to verify your findings and build upon your work. Using R scripts instead of point-and-click software creates an auditable trail of your analytical decisions.
RStudio has four main panes:
In Plain English: Think of R as a super-powered calculator. Before we analyze data, let’s see how R handles basic math - the same operations you’d use on a regular calculator, just typed as text.
R can perform all basic arithmetic operations:
## [1] 8
## [1] 6
## [1] 42
## [1] 5
## [1] 256
## [1] 2
## [1] 32
Why This Matters: Understanding these basic operations is essential because all statistical calculations (means, standard deviations, correlations) use these same mathematical operations behind the scenes.
Key concept – Order of Operations: Like in mathematics, R follows PEMDAS (Parentheses, Exponents, Multiplication/Division, Addition/Subtraction). Use parentheses to control calculation order (Field et al., 2012, ch. 3).
In Plain English: Instead of getting results that
disappear, we can save them with names (called “objects” or
“variables”). Think of it like putting your calculator result in a
labeled box so you can use it later. The arrow <- means
“put this value into that box.”
In R, we store values in objects using the
assignment operator <-:
## [1] 15
## [1] 50
## [1] "A" "B" "name"
## [1] 5
## [1] "R Programming"
Visual Example: Let’s see how objects work with a simple visual:
# Create some objects
apples <- 5
oranges <- 3
# Combine them
total_fruit <- apples + oranges
cat("Apples:", apples, "\n")## Apples: 5
## Oranges: 3
## Total fruit: 8
# Create a simple bar chart to visualize
barplot(c(apples, oranges, total_fruit),
names.arg = c("Apples", "Oranges", "Total"),
col = c("red", "orange", "purple"),
main = "Fruit Counts (Stored in Objects)",
ylab = "Number of Items",
ylim = c(0, 10))Follow these rules for naming objects:
.), and underscores
(_)student_age is better than
xsnake_case or
camelCase@, #,
$, etc.)data and Data are
different!In Plain English: Just like we categorize things in real life (counting numbers,names, yes/no questions), R recognizes different types of data. The type tells R how to handle and display the data. Getting the type right is crucial - you can’t calculate an average of names or sort numbers alphabetically (well, you can, but it won’t make sense!).
R recognizes several fundamental data types:
## [1] "numeric"
## [1] "numeric"
## [1] "character"
## [1] "logical"
## [1] "integer"
Visual Comparison of Data Types:
# Create examples of each type
my_number <- 42.5
my_text <- "Hello"
my_logical <- TRUE
my_integer <- 100L
# Display them in a comparison table
comparison_data <- data.frame(
Type = c("Numeric", "Character", "Logical", "Integer"),
Example = c("42.5", '"Hello"', "TRUE", "100L"),
UsedFor = c("Measurements", "Names/Labels", "Yes/No", "Counts"),
CanDoMath = c("Yes", "No", "Sort of", "Yes")
)
knitr::kable(comparison_data,
caption = "Understanding R's Basic Data Types",
align = c('l', 'l', 'l', 'c'))| Type | Example | UsedFor | CanDoMath |
|---|---|---|---|
| Numeric | 42.5 | Measurements | Yes |
| Character | “Hello” | Names/Labels | No |
| Logical | TRUE | Yes/No | Sort of |
| Integer | 100L | Counts | Yes |
Why Data Types Matter:
## [1] 8
# This doesn't work as expected (mixing types)
tryCatch({
"5" + 3 # Text "5" plus number 3
}, error = function(e) {
cat("ERROR:", conditionMessage(e), "\n\n")
})## ERROR: non-numeric argument to binary operator
## [1] 8
# Visual: Show what happens with wrong types
par(mfrow = c(1, 2))
# Correct: Numeric data in histogram
correct_data <- c(5, 10, 15, 20, 25)
hist(correct_data,
main = "Correct: Numeric Data",
xlab = "Values",
col = "lightgreen",
border = "white")
# Wrong type creates problems
mixed_data <- c("5", "10", "15", "20", "25")
cat("Trying to plot text data doesn't work well:\n")## Trying to plot text data doesn't work well:
## Data type: character
You can convert between types using as.*()
functions:
## [1] "character"
## [1] "numeric"
## [1] TRUE
## [1] FALSE
R has special values for unusual situations:
## [1] TRUE
## [1] TRUE
## [1] Inf
## [1] TRUE
Factors represent categorical data:
# Create a factor
species <- factor(c("Adelie", "Gentoo", "Chinstrap", "Adelie", "Gentoo"))
print(species)## [1] Adelie Gentoo Chinstrap Adelie Gentoo
## Levels: Adelie Chinstrap Gentoo
## [1] "Adelie" "Chinstrap" "Gentoo"
# Specify level order (useful for ordinal data)
satisfaction <- factor(
c("Low", "High", "Medium", "Low", "High"),
levels = c("Low", "Medium", "High"),
ordered = TRUE
)
print(satisfaction)## [1] Low High Medium Low High
## Levels: Low < Medium < High
Key concept – Factors: Field, Miles, and Field (2012, ch. 3) explain that factors are essential for categorical variables in R. Unlike character strings, factors store levels, which is memory-efficient and necessary for many statistical models.
In Plain English: Data structures are different ways to organize information, like different types of containers. You wouldn’t store soup in a paper bag or carry rocks in a colander - similarly, we choose the right data structure for our data. Vectors are like a row of boxes, matrices are like spreadsheets, data frames are like tables where columns can hold different kinds of things, and lists are like filing cabinets that can hold anything.
What is a Vector? A vector is the simplest data structure - think of it as a single row or column of values, where all values are the same type (all numbers, or all text, etc.).
Vectors are collections of elements of the same type:
## [1] 23 25 31 28 26
## [1] "Alice" "Bob" "Carol" "David" "Eve"
## [1] TRUE TRUE FALSE TRUE FALSE
# Sequences
seq1 <- 1:10 # 1 to 10
seq2 <- seq(0, 100, by = 10) # 0 to 100 by 10
rep1 <- rep(5, times = 3) # Repeat 5 three times
print(seq1)## [1] 1 2 3 4 5 6 7 8 9 10
## [1] 0 10 20 30 40 50 60 70 80 90 100
## [1] 5 5 5
Visualizing Vectors:
# Create a numeric vector
test_scores <- c(85, 92, 78, 90, 88, 95, 82, 87, 91, 86)
# Show multiple views
par(mfrow = c(2, 2))
# 1. Simple plot
plot(test_scores,
type = "b", # both points and lines
main = "Test Scores (as a Sequence)",
xlab = "Student Number",
ylab = "Score",
col = "blue",
pch = 19,
ylim = c(75, 100))
# 2. Histogram
hist(test_scores,
main = "Distribution of Scores",
xlab = "Score",
col = "lightblue",
border = "white",
breaks = 5)
# 3. Boxplot
boxplot(test_scores,
main = "Score Summary (Boxplot)",
ylab = "Score",
col = "lightgreen",
horizontal = FALSE)
# 4. Barplot
barplot(test_scores,
names.arg = 1:length(test_scores),
main = "Individual Student Scores",
xlab = "Student",
ylab = "Score",
col = rainbow(length(test_scores)))##
## Summary Statistics:
## Mean: 87.4
## Median: 87.5
## Minimum: 78
## Maximum: 95
## [1] 11 22 33 44 55
## [1] 10 40 90 160 250
## [1] 1 4 9 16 25
## [1] 3
## [1] 15
## [1] 1.581139
## [1] 5
In Plain English: Subsetting means “picking out specific pieces.” Imagine you have a row of 10 boxes numbered 1-10. Subsetting lets you say “give me box 3” or “give me all boxes with values greater than 50.” This is one of the most powerful features in R!
## [1] 85
## [1] 78
## [1] 85 78 88
## [1] 92 90 88
## [1] 85 90 88
## [1] 85 78 90 88
Visualizing Subsetting:
# Create example data
all_scores <- c(72, 85, 92, 78, 90, 88, 95, 82)
student_names <- c("Amy", "Ben", "Chris", "Dana", "Eve", "Frank", "Grace", "Henry")
# Show all scores
par(mfrow = c(2, 2))
# 1. All data
barplot(all_scores,
names.arg = student_names,
main = "All Students",
ylab = "Score",
col = "lightgray",
las = 2) # Rotate labels
abline(h = 85, col = "red", lty = 2, lwd = 2)
text(4, 87, "Pass threshold = 85", col = "red")
# 2. Subset: Scores > 85
high_scores <- all_scores[all_scores > 85]
high_names <- student_names[all_scores > 85]
barplot(high_scores,
names.arg = high_names,
main = "Only High Scorers (> 85)",
ylab = "Score",
col = "lightgreen",
las = 2)
# 3. Subset: First 3 students
barplot(all_scores[1:3],
names.arg = student_names[1:3],
main = "First 3 Students",
ylab = "Score",
col = "lightblue",
las = 2)
# 4. Subset: Specific positions
selected <- c(2, 5, 7)
barplot(all_scores[selected],
names.arg = student_names[selected],
main = "Selected Students (positions 2, 5, 7)",
ylab = "Score",
col = "lightyellow",
las = 2)In Plain English: A matrix is like a spreadsheet where EVERY cell contains the same type of data (all numbers, for example). Think of it as multiple vectors stacked together in rows and columns. Matrices are great for mathematical operations but limited because you can’t mix types.
Matrices are 2-dimensional arrays with rows and columns:
# Create a matrix
my_matrix <- matrix(
data = 1:12,
nrow = 4,
ncol = 3,
byrow = FALSE # Fill by column (default)
)
print(my_matrix)## [,1] [,2] [,3]
## [1,] 1 5 9
## [2,] 2 6 10
## [3,] 3 7 11
## [4,] 4 8 12
# Matrix by row
matrix_byrow <- matrix(
data = 1:12,
nrow = 4,
ncol = 3,
byrow = TRUE
)
print(matrix_byrow)## [,1] [,2] [,3]
## [1,] 1 2 3
## [2,] 4 5 6
## [3,] 7 8 9
## [4,] 10 11 12
## [1] 4 3
## [1] 4
## [1] 3
Visualizing Matrix Structure:
# Create a matrix with meaningful data
grades_matrix <- matrix(
c(85, 90, 78, # Row 1: Test scores for Student 1
92, 88, 95, # Row 2: Test scores for Student 2
78, 85, 82, # Row 3: Test scores for Student 3
90, 87, 91), # Row 4: Test scores for Student 4
nrow = 4,
ncol = 3,
byrow = TRUE
)
# Add labels
rownames(grades_matrix) <- c("Student1", "Student2", "Student3", "Student4")
colnames(grades_matrix) <- c("Test1", "Test2", "Test3")
print(grades_matrix)## Test1 Test2 Test3
## Student1 85 90 78
## Student2 92 88 95
## Student3 78 85 82
## Student4 90 87 91
# Visualize the matrix
par(mfrow = c(1, 2))
# 1. Heatmap view
image(t(grades_matrix),
main = "Grade Matrix (Heatmap)",
xlab = "Tests",
ylab = "Students",
col = heat.colors(20),
axes = FALSE)
axis(1, at = seq(0, 1, length.out = 3), labels = colnames(grades_matrix))
axis(2, at = seq(0, 1, length.out = 4), labels = rownames(grades_matrix), las = 2)
# 2. Grouped barplot
barplot(t(grades_matrix),
beside = TRUE,
main = "Grades by Student and Test",
xlab = "Student",
ylab = "Score",
col = c("skyblue", "lightgreen", "salmon"),
legend.text = colnames(grades_matrix),
args.legend = list(x = "topright", cex = 0.8))## [1] 10
## [1] 1 5 9
## [1] 5 6 7 8
## [,1] [,2]
## [1,] 5 9
## [2,] 6 10
# Matrix arithmetic
mat1 <- matrix(1:4, nrow = 2)
mat2 <- matrix(5:8, nrow = 2)
mat1 + mat2 # Element-wise addition## [,1] [,2]
## [1,] 6 10
## [2,] 8 12
## [,1] [,2]
## [1,] 5 21
## [2,] 12 32
## [,1] [,2]
## [1,] 23 31
## [2,] 34 46
## [,1] [,2]
## [1,] 1 2
## [2,] 3 4
In Plain English: Data frames are the MOST IMPORTANT data structure you’ll use! Think of them like Excel spreadsheets - they have rows and columns, BUT unlike matrices, each column can be a different type (one column can be names, another can be ages, another can be yes/no values). Most real-world data comes in data frames.
Data frames are the most common structure for datasets:
# Create a data frame
students <- data.frame(
Name = c("Alice", "Bob", "Carol", "David"),
Age = c(20, 22, 21, 23),
Grade = c(85, 92, 78, 88),
Passed = c(TRUE, TRUE, TRUE, TRUE)
)
print(students)## Name Age Grade Passed
## 1 Alice 20 85 TRUE
## 2 Bob 22 92 TRUE
## 3 Carol 21 78 TRUE
## 4 David 23 88 TRUE
## 'data.frame': 4 obs. of 4 variables:
## $ Name : chr "Alice" "Bob" "Carol" "David"
## $ Age : num 20 22 21 23
## $ Grade : num 85 92 78 88
## $ Passed: logi TRUE TRUE TRUE TRUE
## [1] 4 4
## [1] 4
## [1] 4
## [1] "Name" "Age" "Grade" "Passed"
## [1] "Name" "Age" "Grade" "Passed"
Visualizing Data Frame Concepts:
# Create a more detailed data frame
class_data <- data.frame(
Student = c("Alice", "Bob", "Carol", "David", "Eve", "Frank"),
Age = c(20, 22, 21, 23, 20, 22),
Grade = c(85, 92, 78, 88, 95, 82),
StudyHours = c(15, 20, 10, 18, 25, 12),
PassFail = c("Pass", "Pass", "Fail", "Pass", "Pass", "Pass")
)
# Show it as a nice table
knitr::kable(class_data,
caption = "Example Data Frame: Student Information",
align = 'c')| Student | Age | Grade | StudyHours | PassFail |
|---|---|---|---|---|
| Alice | 20 | 85 | 15 | Pass |
| Bob | 22 | 92 | 20 | Pass |
| Carol | 21 | 78 | 10 | Fail |
| David | 23 | 88 | 18 | Pass |
| Eve | 20 | 95 | 25 | Pass |
| Frank | 22 | 82 | 12 | Pass |
# Multiple visualizations from ONE data frame
par(mfrow = c(2, 2))
# 1. Grades distribution
hist(class_data$Grade,
main = "Grade Distribution",
xlab = "Grade",
col = "lightblue",
border = "white",
breaks = 5)
abline(v = mean(class_data$Grade), col = "red", lwd = 2, lty = 2)
legend("topright", "Mean", col = "red", lty = 2, lwd = 2)
# 2. Study hours vs Grade
plot(class_data$StudyHours, class_data$Grade,
main = "Study Hours vs Grade",
xlab = "Study Hours per Week",
ylab = "Grade",
pch = 19,
col = "darkgreen",
cex = 1.5)
abline(lm(Grade ~ StudyHours, data = class_data), col = "red", lwd = 2)
# 3. Age distribution
barplot(table(class_data$Age),
main = "Age Distribution",
xlab = "Age",
ylab = "Number of Students",
col = "salmon")
# 4. Pass/Fail summary
pass_fail_count <- table(class_data$PassFail)
barplot(pass_fail_count,
main = "Pass/Fail Summary",
ylab = "Count",
col = c("red", "green"),
ylim = c(0, max(pass_fail_count) + 1))##
## === DATA FRAME SUMMARY ===
## Number of students: 6
## Number of variables: 5
## Average grade: 86.7
## Average study hours: 16.7
## [1] "Alice" "Bob" "Carol" "David"
## [1] 85 92 78 88
## [1] 85 92 78 88
## [1] 92
## [1] 92
In Plain English: Filtering means “show me only the rows that meet certain conditions” - like using a filter in Excel. This is incredibly powerful for analyzing specific subsets of your data!
Visualizing Filtering Effects:
# Create sample data
all_students <- data.frame(
Name = c("Amy", "Ben", "Chris", "Dana", "Eve", "Frank", "Grace", "Henry"),
Score = c(72, 85, 92, 78, 90, 88, 95, 82),
Attendance = c(75, 90, 95, 80, 92, 88, 98, 85)
)
# Define filtering criteria
high_performers <- all_students[all_students$Score >= 85 & all_students$Attendance >= 88, ]
# Show the filtering visually
par(mfrow = c(1, 2))
# Before filtering
plot(all_students$Attendance, all_students$Score,
main = "All Students",
xlab = "Attendance %",
ylab = "Score",
pch = 19,
cex = 2,
col = "lightgray",
xlim = c(70, 100),
ylim = c(70, 100))
abline(h = 85, col = "red", lty = 2, lwd = 2)
abline(v = 88, col = "blue", lty = 2, lwd = 2)
text(75, 97, "Score ≥ 85", col = "red")
text(95, 73, "Attendance ≥ 88", col = "blue", srt = 90)
legend("bottomright",
legend = c("All students", "Criteria lines"),
col = c("lightgray", "red"),
pch = c(19, NA),
lty = c(NA, 2))
# After filtering
plot(high_performers$Attendance, high_performers$Score,
main = "High Performers Only\n(Score ≥ 85 AND Attendance ≥ 88)",
xlab = "Attendance %",
ylab = "Score",
pch = 19,
cex = 2,
col = "darkgreen",
xlim = c(70, 100),
ylim = c(70, 100))
text(high_performers$Attendance, high_performers$Score,
labels = high_performers$Name,
pos = 3,
cex = 0.8)##
## === FILTERING RESULTS ===
## Original: 8 students
## After filtering: 5 students
## Filtered out: 3 students
Important: The
subset()function automatically removes rows withNAvalues. Use bracket notation[ ]if you want to keepNAs.
Lists can contain different types and structures:
# Create a list
my_list <- list(
numbers = 1:5,
text = c("a", "b", "c"),
logical = c(TRUE, FALSE),
matrix = matrix(1:4, nrow = 2),
dataframe = students
)
# View structure
str(my_list)## List of 5
## $ numbers : int [1:5] 1 2 3 4 5
## $ text : chr [1:3] "a" "b" "c"
## $ logical : logi [1:2] TRUE FALSE
## $ matrix : int [1:2, 1:2] 1 2 3 4
## $ dataframe:'data.frame': 4 obs. of 4 variables:
## ..$ Name : chr [1:4] "Alice" "Bob" "Carol" "David"
## ..$ Age : num [1:4] 20 22 21 23
## ..$ Grade : num [1:4] 85 92 78 88
## ..$ Passed: logi [1:4] TRUE TRUE TRUE TRUE
## [1] 1 2 3 4 5
## [1] 1 2 3 4 5
## [1] "a" "b" "c"
## [1] "Alice" "Bob" "Carol" "David"
Before importing data, know where R is looking for files:
## [1] "D:/Github/data_sciences/ANLY500-Analytics-I/Week01"
# Set working directory (use forward slashes or double backslashes)
# setwd("C:/Users/YourName/Documents/R_Projects")
# setwd("C:\\Users\\YourName\\Documents\\R_Projects")"data/myfile.csv"
instead of full pathssetwd(): Breaks when you share
codeThe rio package handles many file formats
automatically:
Once data is loaded, explore it:
## 'data.frame': 153 obs. of 6 variables:
## $ Ozone : int 41 36 12 18 NA 28 23 19 8 NA ...
## $ Solar.R: int 190 118 149 313 NA NA 299 99 19 194 ...
## $ Wind : num 7.4 8 12.6 11.5 14.3 14.9 8.6 13.8 20.1 8.6 ...
## $ Temp : int 67 72 74 62 56 66 65 59 61 69 ...
## $ Month : int 5 5 5 5 5 5 5 5 5 5 ...
## $ Day : int 1 2 3 4 5 6 7 8 9 10 ...
## Ozone Solar.R Wind Temp
## Min. : 1.00 Min. : 7.0 Min. : 1.700 Min. :56.00
## 1st Qu.: 18.00 1st Qu.:115.8 1st Qu.: 7.400 1st Qu.:72.00
## Median : 31.50 Median :205.0 Median : 9.700 Median :79.00
## Mean : 42.13 Mean :185.9 Mean : 9.958 Mean :77.88
## 3rd Qu.: 63.25 3rd Qu.:258.8 3rd Qu.:11.500 3rd Qu.:85.00
## Max. :168.00 Max. :334.0 Max. :20.700 Max. :97.00
## NA's :37 NA's :7
## Month Day
## Min. :5.000 Min. : 1.0
## 1st Qu.:6.000 1st Qu.: 8.0
## Median :7.000 Median :16.0
## Mean :6.993 Mean :15.8
## 3rd Qu.:8.000 3rd Qu.:23.0
## Max. :9.000 Max. :31.0
##
## [1] 153 6
## [1] 153
## [1] 6
## [1] "Ozone" "Solar.R" "Wind" "Temp" "Month" "Day"
In Plain English: Before we get to statistics, you need to understand how R compares values and makes logical decisions. These operations are the foundation of filtering data (“show me all students who scored > 85”) and conditional calculations (“if age < 18, then minor”).
## [1] 15
## [1] 5
## [1] 50
## [1] 2
## [1] 100
## [1] 30
## [1] 20
## [1] 4
## [1] 10
## [1] 2.302585
## [1] 2
## [1] 2.718282
In Plain English: Comparison operators ask questions and return TRUE or FALSE. Think of them like a quality control inspector: “Is this product weight > 100 grams?” YES (TRUE) or NO (FALSE). These TRUE/FALSE answers then let you filter and select data.
Comparisons return TRUE or FALSE:
## [1] TRUE
## [1] TRUE
## [1] TRUE
## [1] FALSE
## [1] TRUE
## [1] FALSE
## [1] FALSE FALSE FALSE TRUE TRUE
## [1] FALSE FALSE TRUE FALSE FALSE
Visualizing Comparison Operations:
test_scores <- c(72, 85, 91, 68, 95, 78, 88, 82)
threshold <- 80
par(mfrow = c(2, 2))
# 1. Visual representation of scores vs threshold
barplot(test_scores,
names.arg = paste0("S", 1:length(test_scores)),
main = "Scores vs Threshold (80)",
ylab = "Score",
col = ifelse(test_scores >= threshold, "green", "red"),
ylim = c(0, 100))
abline(h = threshold, col = "blue", lwd = 3, lty = 2)
text(4, threshold + 3, "Threshold = 80", col = "blue", font = 2)
legend("bottomright",
legend = c("Passed (≥80)", "Failed (<80)"),
fill = c("green", "red"))
# 2. Truth table for ==, >, <
comparison_results <- data.frame(
Value = test_scores,
"Equal_80" = test_scores == 80,
"Greater_80" = test_scores > 80,
"Less_80" = test_scores < 80,
"GreaterEqual_80" = test_scores >= 80
)
# Show counts
counts <- c(sum(test_scores == 80),
sum(test_scores > 80),
sum(test_scores < 80),
sum(test_scores >= 80))
barplot(counts,
names.arg = c("== 80", "> 80", "< 80", "≥ 80"),
main = "Comparison Operations Count",
ylab = "Number of Students",
col = rainbow(4),
ylim = c(0, max(counts) + 1))
text(x = 1:4 * 1.2 - 0.5, y = counts + 0.3, labels = counts, font = 2)
# 3. Element-wise comparison visualization
x_demo <- c(1, 2, 3, 4, 5)
compare_3 <- x_demo > 3
barplot(rbind(x_demo, compare_3 * 5), # Scale TRUE/FALSE to 5 for visibility
beside = TRUE,
names.arg = paste0("x[", 1:5, "]"),
main = "Element-wise: x > 3",
ylab = "Value",
col = c("lightblue", "salmon"),
legend.text = c("Original Value", "x > 3 (TRUE=5, FALSE=0)"),
args.legend = list(x = "topleft", cex = 0.8))
# 4. Multiple comparisons on same data
age_data <- c(15, 18, 22, 17, 25, 30, 16, 21)
comparisons_matrix <- rbind(
"< 18" = sum(age_data < 18),
"18-21" = sum(age_data >= 18 & age_data <= 21),
"21-25" = sum(age_data > 21 & age_data <= 25),
"> 25" = sum(age_data > 25)
)
barplot(comparisons_matrix,
main = "Age Groups Using Comparisons",
xlab = "Age Range",
ylab = "Count",
col = c("lightblue", "lightgreen", "yellow", "salmon"),
ylim = c(0, max(comparisons_matrix) + 1))
text(x = 0.7, y = comparisons_matrix + 0.2, labels = comparisons_matrix, font = 2)##
## === COMPARISON OPERATIONS DEMONSTRATION ===
##
## Test Scores: 72 85 91 68 95 78 88 82
## Threshold: 80
## Comparison Results:
## Scores == 80: 0 students
## Scores > 80: 5 students
## Scores < 80: 3 students
## Scores >= 80: 5 students (PASSED)
## Scores <= 79: 3 students (FAILED)
##
## Which students passed (>=80)?
## Students: S2, S3, S5, S7, S8
## Their scores: 85 91 95 88 82
In Plain English: Logical operators combine multiple TRUE/FALSE questions. Think of airport security: “Do you have a ticket AND a passport?” Both must be TRUE. Or think of discounts: “Students OR seniors get 20% off” - either one being TRUE gets the discount. These operators let you create complex filters like “students who scored > 85 AND attended > 90% of classes.”
Combine conditions:
## [1] TRUE
## [1] FALSE
## [1] TRUE
## [1] FALSE
## [1] FALSE
## [1] TRUE
## [1] TRUE
## [1] FALSE
Visualizing Logical Operators:
par(mfrow = c(2, 2))
# 1. AND operator truth table visualization
and_results <- c(
"T & T" = 1, # TRUE
"T & F" = 0, # FALSE
"F & T" = 0, # FALSE
"F & F" = 0 # FALSE
)
barplot(and_results,
main = "AND Operator (&)\nBoth Must Be TRUE",
ylab = "Result (1=TRUE, 0=FALSE)",
col = ifelse(and_results == 1, "green", "red"),
ylim = c(0, 1.2))
text(x = 1:4 * 1.2 - 0.5, y = and_results + 0.1,
labels = ifelse(and_results == 1, "TRUE", "FALSE"),
font = 2)
# 2. OR operator truth table visualization
or_results <- c(
"T | T" = 1, # TRUE
"T | F" = 1, # TRUE
"F | T" = 1, # TRUE
"F | F" = 0 # FALSE
)
barplot(or_results,
main = "OR Operator (|)\nAt Least One Must Be TRUE",
ylab = "Result (1=TRUE, 0=FALSE)",
col = ifelse(or_results == 1, "green", "red"),
ylim = c(0, 1.2))
text(x = 1:4 * 1.2 - 0.5, y = or_results + 0.1,
labels = ifelse(or_results == 1, "TRUE", "FALSE"),
font = 2)
# 3. Real-world example: filtering students
student_data <- data.frame(
Name = c("Alice", "Bob", "Carol", "David", "Eve", "Frank", "Grace", "Henry"),
Grade = c(92, 78, 85, 68, 95, 72, 88, 82),
Attendance = c(95, 85, 92, 75, 98, 70, 90, 88)
)
# Different filtering criteria
high_grade <- student_data$Grade >= 85
high_attendance <- student_data$Attendance >= 90
filter_counts <- c(
"Only High Grade" = sum(high_grade),
"Only High Attend" = sum(high_attendance),
"BOTH (AND)" = sum(high_grade & high_attendance),
"EITHER (OR)" = sum(high_grade | high_attendance)
)
barplot(filter_counts,
main = "Filtering Students\n(Grade≥85, Attendance≥90)",
ylab = "Number of Students",
col = c("lightblue", "lightgreen", "purple", "orange"),
ylim = c(0, max(filter_counts) + 1),
las = 2)
text(x = 1:4 * 1.2 - 0.5, y = filter_counts + 0.3,
labels = filter_counts, font = 2)
# 4. Venn diagram-style visualization
plot(0:10, 0:10, type = "n", axes = FALSE, xlab = "", ylab = "",
main = "Logical Operators Visualized")
# Draw circles
symbols(3, 5, circles = 2.5, inches = FALSE, add = TRUE,
fg = "blue", lwd = 3)
symbols(7, 5, circles = 2.5, inches = FALSE, add = TRUE,
fg = "red", lwd = 3)
# Label regions
text(2, 5, "Grade≥85\nOnly", col = "blue", font = 2)
text(8, 5, "Attend≥90\nOnly", col = "red", font = 2)
text(5, 5, "BOTH\n(AND)", col = "purple", font = 2, cex = 1.2)
text(5, 1, "EITHER (OR) = Everything inside both circles",
col = "orange", font = 2, cex = 0.9)##
## === LOGICAL OPERATORS DEMONSTRATION ===
##
## Student Data:
## Name Grade Attendance
## 1 Alice 92 95
## 2 Bob 78 85
## 3 Carol 85 92
## 4 David 68 75
## 5 Eve 95 98
## 6 Frank 72 70
## 7 Grace 88 90
## 8 Henry 82 88
##
## 1. AND Operator (&) - BOTH conditions must be TRUE:
excellence <- student_data[high_grade & high_attendance, ]
cat(" Students with Grade≥85 AND Attendance≥90:\n")## Students with Grade≥85 AND Attendance≥90:
## Name Grade Attendance
## 1 Alice 92 95
## 3 Carol 85 92
## 5 Eve 95 98
## 7 Grace 88 90
##
## 2. OR Operator (|) - AT LEAST ONE condition must be TRUE:
doing_well <- student_data[high_grade | high_attendance, ]
cat(" Students with Grade≥85 OR Attendance≥90:\n")## Students with Grade≥85 OR Attendance≥90:
## Name Grade Attendance
## 1 Alice 92 95
## 3 Carol 85 92
## 5 Eve 95 98
## 7 Grace 88 90
##
## 3. NOT Operator (!) - NEGATES the condition:
struggling <- student_data[!high_grade & !high_attendance, ]
cat(" Students with Grade<85 AND Attendance<90:\n")## Students with Grade<85 AND Attendance<90:
## Name Grade Attendance
## 2 Bob 78 85
## 4 David 68 75
## 6 Frank 72 70
## 8 Henry 82 88
##
## 4. Complex Combination:
## Students with (Grade≥85 AND Attendance≥90) OR Grade≥95:
special <- student_data[(high_grade & high_attendance) | (student_data$Grade >= 95), ]
print(special)## Name Grade Attendance
## 1 Alice 92 95
## 3 Carol 85 92
## 5 Eve 95 98
## 7 Grace 88 90
##
## Summary Counts:
## Total students: 8
## High grade (≥85): 4
## High attendance (≥90): 4
## Both (AND): 4
## Either (OR): 4
## Neither: 4
Common Beginner Mistake: Using
=instead of==for comparison! Remember:x = 5ASSIGNS 5 to x, whilex == 5CHECKS if x equals 5. Also, remember that&and|work element-wise on vectors, while&&and||only look at the first element (used in if-statements, which you’ll learn later).
In Plain English: Real-world data is messy!
Sometimes values are missing - maybe someone didn’t answer a survey
question, or a sensor failed to record a measurement. R represents
missing data as NA (Not Available). Learning to identify
and handle missing data is CRUCIAL because ignoring it can lead to wrong
conclusions.
## [1] FALSE FALSE TRUE FALSE FALSE TRUE
## [1] 2
## [1] 3 6
## [1] 44
## Ozone Solar.R Wind Temp Month Day
## 37 7 0 0 0 0
Visualizing Missing Data:
# Create example dataset with missing values
set.seed(123)
example_data <- data.frame(
Student = paste0("S", 1:20),
Test1 = c(85, 90, NA, 78, 92, 88, NA, 76, 95, 82,
87, NA, 91, 84, 89, 93, 81, NA, 86, 88),
Test2 = c(82, NA, 88, 76, 90, NA, 85, 79, NA, 84,
86, 89, 92, NA, 87, 91, 83, 85, NA, 87),
Test3 = c(NA, 92, 86, NA, 94, 87, 83, NA, 96, 85,
88, 90, NA, 86, NA, 94, 84, 86, 88, NA)
)
# Visualize the pattern of missing data
par(mfrow = c(2, 2))
# 1. Missing data pattern
missing_matrix <- is.na(example_data[, -1]) # Exclude Student column
image(1:ncol(missing_matrix), 1:nrow(missing_matrix),
t(missing_matrix),
col = c("lightblue", "red"),
main = "Missing Data Pattern\n(Red = Missing)",
xlab = "Test",
ylab = "Student",
axes = FALSE)
axis(1, at = 1:3, labels = c("Test1", "Test2", "Test3"))
axis(2, at = seq(1, 20, 5), labels = seq(1, 20, 5), las = 2)
# 2. Count of missing per test
missing_counts <- colSums(is.na(example_data[, -1]))
barplot(missing_counts,
main = "Missing Values per Test",
ylab = "Number Missing",
col = "salmon",
ylim = c(0, max(missing_counts) + 2))
text(x = 1:length(missing_counts) * 1.2 - 0.5,
y = missing_counts + 0.5,
labels = missing_counts)
# 3. Complete vs incomplete cases
complete_status <- ifelse(complete.cases(example_data), "Complete", "Has Missing")
status_table <- table(complete_status)
barplot(status_table,
main = "Students: Complete vs Incomplete Data",
ylab = "Count",
col = c("green", "red"))
# 4. Comparison: with and without NAs
test1_with_na <- example_data$Test1
test1_without_na <- test1_with_na[!is.na(test1_with_na)]
boxplot(test1_with_na, test1_without_na,
names = c("With NAs\n(errors out)", "NAs Removed\n(works)"),
main = "Effect of NA on Analysis",
ylab = "Test1 Score",
col = c("pink", "lightgreen"))##
## === MISSING DATA SUMMARY ===
## Total students: 20
## Students with complete data: 5
## Students with missing data: 15
## Total missing values: 15
Key concept – Missing Data: Field, Miles, and Field (2012, ch. 5) discuss three types of missing data mechanisms: Missing Completely At Random (MCAR), Missing At Random (MAR), and Missing Not At Random (MNAR). Understanding why data is missing determines appropriate handling strategies.
In Plain English: Sometimes the best solution is to simply remove rows with missing data. Think of it like throwing away a damaged product on an assembly line. BUT be careful - if you throw away too much data, you might not have enough left for reliable analysis!
## [1] 1 2 4 5
## [1] 1 2 4 5
## attr(,"na.action")
## [1] 3 6
## attr(,"class")
## [1] "omit"
## [1] 153
## [1] 111
# Use complete.cases()
complete_rows <- complete.cases(airquality)
airquality_filtered <- airquality[complete_rows, ]Visualizing the Impact of Removing NAs:
par(mfrow = c(2, 2))
# 1. Show data loss
data_summary <- c(Original = nrow(airquality),
Complete = nrow(airquality_complete))
barplot(data_summary,
main = "Data Loss from NA Removal",
ylab = "Number of Rows",
col = c("lightblue", "lightgreen"),
ylim = c(0, max(data_summary) + 20))
text(x = 1:2 * 1.2 - 0.5,
y = data_summary + 5,
labels = paste0(data_summary, " rows\n",
round(data_summary/nrow(airquality)*100, 1), "%"))
# 2. Compare distributions: Original Ozone
hist(airquality$Ozone,
main = "Original Ozone Data\n(with NAs)",
xlab = "Ozone Level",
col = "pink",
breaks = 15)
abline(v = mean(airquality$Ozone, na.rm = TRUE),
col = "red", lwd = 2, lty = 2)
text(x = mean(airquality$Ozone, na.rm = TRUE), y = 10,
labels = paste0("Mean = ", round(mean(airquality$Ozone, na.rm = TRUE), 1)),
pos = 4, col = "red")
# 3. Compare distributions: After NA removal
hist(airquality_complete$Ozone,
main = "Complete Cases Only\n(NAs removed)",
xlab = "Ozone Level",
col = "lightgreen",
breaks = 15)
abline(v = mean(airquality_complete$Ozone),
col = "darkgreen", lwd = 2, lty = 2)
text(x = mean(airquality_complete$Ozone), y = 10,
labels = paste0("Mean = ", round(mean(airquality_complete$Ozone), 1)),
pos = 4, col = "darkgreen")
# 4. Side-by-side boxplot comparison
boxplot(airquality$Ozone, airquality_complete$Ozone,
names = c("With NAs\n(n=153)",
"Complete Cases\n(n=111)"),
main = "Distribution Comparison",
ylab = "Ozone Level",
col = c("pink", "lightgreen"))##
## === IMPACT OF NA REMOVAL ===
## Original rows: 153
## Rows with complete data: 111
## Rows removed: 42
cat("Percentage of data retained:",
round(nrow(airquality_complete)/nrow(airquality)*100, 1), "%\n")## Percentage of data retained: 72.5 %
##
## Ozone Mean (with na.rm=TRUE): 42.13
## Ozone Mean (complete cases only): 42.1
cat("Difference:",
round(mean(airquality$Ozone, na.rm = TRUE) - mean(airquality_complete$Ozone), 2), "\n")## Difference: 0.03
Important: Notice how removing NAs can change your dataset! In this example, we lost 42 rows (27% of data). Always report how much data was removed due to missing values, as this affects the generalizability of your results.
In Plain English: Here’s a trap beginners often fall
into - if your data has even ONE missing value, functions like
mean() will return NA instead of a number!
It’s like asking “What’s the average height?” when one person didn’t
report their height - R says “I can’t tell you an ‘average’ because I
don’t have all the data.” The solution? Use na.rm = TRUE
(NA remove = TRUE) to tell R “ignore the missing values and calculate
anyway.”
## [1] NA
## [1] 42.12931
## [1] 31.5
## [1] 32.98788
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1.00 18.00 31.50 42.13 63.25 168.00 37
Visualizing Why na.rm = TRUE
Matters:
# Create example with and without NAs
clean_scores <- c(85, 90, 78, 92, 88, 76, 95, 82)
dirty_scores <- c(85, 90, NA, 92, 88, NA, 95, 82)
par(mfrow = c(2, 2))
# 1. Comparison bar chart
methods <- c("Without na.rm", "With na.rm=TRUE")
results <- c(NA, mean(dirty_scores, na.rm = TRUE))
barplot(results,
names.arg = methods,
main = "Effect of na.rm Parameter",
ylab = "Mean Score",
col = c("red", "green"),
ylim = c(0, 100))
text(x = c(0.7, 1.9), y = c(5, results[2] + 3),
labels = c("Returns NA!", round(results[2], 1)))
# 2. Show clean vs dirty data distribution
boxplot(clean_scores, dirty_scores,
names = c("Clean Data", "Data with NAs"),
main = "Comparing Clean vs. Dirty Data",
ylab = "Score",
col = c("lightgreen", "pink"))
abline(h = mean(clean_scores), col = "darkgreen", lty = 2, lwd = 2)
abline(h = mean(dirty_scores, na.rm = TRUE), col = "red", lty = 2, lwd = 2)
legend("bottomright",
legend = c("Mean (clean)", "Mean (dirty, na.rm=T)"),
col = c("darkgreen", "red"), lty = 2, lwd = 2)
# 3. Multiple statistics comparison
stats_clean <- c(Mean = mean(clean_scores),
Median = median(clean_scores),
SD = sd(clean_scores))
stats_dirty <- c(Mean = mean(dirty_scores, na.rm = TRUE),
Median = median(dirty_scores, na.rm = TRUE),
SD = sd(dirty_scores, na.rm = TRUE))
barplot(rbind(stats_clean, stats_dirty),
beside = TRUE,
main = "Clean vs Dirty: Multiple Statistics",
ylab = "Value",
col = c("lightgreen", "pink"),
legend.text = c("Clean Data", "With NAs (na.rm=T)"),
args.legend = list(x = "topright"))
# 4. Show how many observations were used
n_clean <- length(clean_scores)
n_dirty_total <- length(dirty_scores)
n_dirty_complete <- sum(!is.na(dirty_scores))
counts <- rbind(c(n_clean, n_clean),
c(n_dirty_total, n_dirty_complete))
barplot(counts,
beside = TRUE,
main = "Sample Size: Total vs. Used",
names.arg = c("Total N", "Used in Calculation"),
ylab = "Count",
col = c("lightgreen", "pink"),
legend.text = c("Clean Data", "With NAs"),
args.legend = list(x = "topleft"))##
## === COMPARISON: CLEAN VS. DIRTY DATA ===
##
## CLEAN DATA (no missing):
## N: 8
## Mean: 85.75
## Median: 86.5
## SD: 6.73
##
## DIRTY DATA (with NAs):
## Total N: 8
## Missing: 2
## Complete: 6
## Mean (without na.rm): NA
## Mean (with na.rm=TRUE): 88.67
## Median (with na.rm=TRUE): 89
## SD (with na.rm=TRUE): 4.72
Pro Tip: Always use
na.rm = TRUEwith statistical functions when you have missing data. But also REPORT how many values were missing! Saying “mean = 87” is misleading if based on 6 observations when you started with 8.
In Plain English: Descriptive statistics are like the “highlights reel” of your data. Instead of looking at hundreds or thousands of numbers, you get a few key summaries: What’s typical? (mean, median) How spread out? (SD, range) What are the extremes? (min, max). These are your data’s “vital signs.”
# Create sample data
scores <- c(85, 92, 78, 90, 88, 95, 82, 87)
# Measures of central tendency
mean(scores) # Average## [1] 87.125
## [1] 87.5
## [1] 5.462535
## [1] 29.83929
## [1] 78 95
## [1] 78
## [1] 95
## [1] 697
## [1] 8
## 25% 50% 75%
## 84.25 87.50 90.50
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 78.00 84.25 87.50 87.12 90.50 95.00
Visualizing Descriptive Statistics:
par(mfrow = c(2, 2))
# 1. Histogram with mean and median
hist(scores,
main = "Distribution with Central Tendency",
xlab = "Score",
col = "lightblue",
breaks = 8,
ylim = c(0, 3))
abline(v = mean(scores), col = "red", lwd = 3, lty = 2)
abline(v = median(scores), col = "blue", lwd = 3, lty = 2)
legend("topright",
legend = c(paste("Mean =", round(mean(scores), 1)),
paste("Median =", round(median(scores), 1))),
col = c("red", "blue"), lty = 2, lwd = 3)
# 2. Boxplot showing quartiles and outliers
boxplot(scores,
main = "Boxplot: Quartiles & Range",
ylab = "Score",
col = "lightgreen",
horizontal = FALSE)
points(1, mean(scores), pch = 18, col = "red", cex = 2)
text(1.3, quantile(scores, 0.75), "Q3 (75%)", pos = 4)
text(1.3, median(scores), "Median (Q2)", pos = 4)
text(1.3, quantile(scores, 0.25), "Q1 (25%)", pos = 4)
text(1.3, mean(scores), "Mean", pos = 4, col = "red")
# 3. Barplot of all individual scores with mean line
barplot(scores,
main = "Individual Scores",
ylab = "Score",
xlab = "Student",
col = rainbow(length(scores)),
ylim = c(0, 100))
abline(h = mean(scores), col = "black", lwd = 2, lty = 2)
text(4, mean(scores) + 3, paste("Mean =", round(mean(scores), 1)))
# 4. Comparing key statistics
key_stats <- c(Min = min(scores),
Q1 = quantile(scores, 0.25),
Median = median(scores),
Mean = mean(scores),
Q3 = quantile(scores, 0.75),
Max = max(scores))
barplot(key_stats,
main = "Summary Statistics Overview",
ylab = "Value",
col = c("red", "orange", "yellow", "lightgreen", "lightblue", "purple"),
ylim = c(0, 100),
las = 2)
text(x = 1:length(key_stats) * 1.2 - 0.5,
y = key_stats + 3,
labels = round(key_stats, 1))##
## === DESCRIPTIVE STATISTICS SUMMARY ===
## Sample size (N): 8
##
## Measures of Central Tendency:
## Mean: 87.12
## Median: 87.5
##
## Measures of Dispersion:
## Standard Deviation: 5.46
## Variance: 29.84
## Range: 78 to 95
## IQR (Q3-Q1): 6.25
##
## Extreme Values:
## Minimum: 78
## Maximum: 95
##
## Quartiles:
## Q1 (25th percentile): 84.25
## Q2 (50th percentile / Median): 87.5
## Q3 (75th percentile): 90.5
Field et al. (2012, ch. 2) on Descriptive Statistics: “Before you can analyze your data, you need to describe it. Central tendency tells you about the typical score, dispersion tells you about the variability, and together they give you a complete picture of your data’s distribution.”
In Plain English: Frequency tables answer the question “How many?” How many students got an A? How many cars have 4 cylinders? They’re especially useful for categorical data (like grades, colors, or yes/no responses). Think of it as counting how many times each unique value appears in your data.
# Create categorical data
grades <- c("A", "B", "A", "C", "B", "A", "B", "A", "C", "B")
# Frequency table
table(grades)## grades
## A B C
## 4 4 2
## grades
## A B C
## 0.4 0.4 0.2
##
## 3 4 5
## 4 1 8 2
## 6 2 4 1
## 8 12 0 2
##
## 3 4 5 Sum
## 4 1 8 2 11
## 6 2 4 1 7
## 8 12 0 2 14
## Sum 15 12 5 32
Visualizing Frequency Tables:
par(mfrow = c(2, 2))
# 1. Simple bar chart of grades
grade_freq <- table(grades)
barplot(grade_freq,
main = "Grade Distribution",
xlab = "Grade",
ylab = "Frequency (Count)",
col = c("gold", "lightgray", "coral"),
ylim = c(0, max(grade_freq) + 1))
text(x = 1:length(grade_freq) * 1.2 - 0.5,
y = grade_freq + 0.3,
labels = grade_freq)
# 2. Proportions (percentage) chart
grade_prop <- prop.table(grade_freq)
barplot(grade_prop,
main = "Grade Distribution (Proportions)",
xlab = "Grade",
ylab = "Proportion",
col = c("gold", "lightgray", "coral"),
ylim = c(0, max(grade_prop) + 0.1))
text(x = 1:length(grade_prop) * 1.2 - 0.5,
y = grade_prop + 0.02,
labels = paste0(round(grade_prop * 100, 1), "%"))
# 3. Cross-tabulation: Cylinders vs Gears (heatmap style)
cross_tab <- table(mtcars$cyl, mtcars$gear)
barplot(cross_tab,
beside = TRUE,
main = "Cars: Cylinders × Gears",
xlab = "Number of Gears",
ylab = "Count",
col = c("lightblue", "lightgreen", "salmon"),
legend.text = c("4 cyl", "6 cyl", "8 cyl"),
args.legend = list(x = "topright"))
# 4. Cross-tabulation as heatmap
image(1:ncol(cross_tab), 1:nrow(cross_tab),
t(as.matrix(cross_tab)),
col = heat.colors(max(cross_tab)),
main = "Heatmap: Cylinders × Gears",
xlab = "Number of Gears",
ylab = "Number of Cylinders",
axes = FALSE)
axis(1, at = 1:ncol(cross_tab), labels = colnames(cross_tab))
axis(2, at = 1:nrow(cross_tab), labels = rownames(cross_tab), las = 2)
# Add cell values
for(i in 1:nrow(cross_tab)) {
for(j in 1:ncol(cross_tab)) {
text(j, i, cross_tab[i,j], col = "black", cex = 1.5)
}
}##
## === FREQUENCY TABLE SUMMARY ===
##
## Grade Frequencies:
## grades
## A B C
## 4 4 2
##
## Grade Proportions:
## grades
## A B C
## 0.4 0.4 0.2
##
## Percentages:
## [1] "A: 40%" "B: 40%" "C: 20%"
##
##
## Cross-Tabulation: Cylinders × Gears
##
## 3 4 5 Sum
## 4 1 8 2 11
## 6 2 4 1 7
## 8 12 0 2 14
## Sum 15 12 5 32
cat("\nInterpretation: Most common combination is",
paste0(rownames(cross_tab)[which(cross_tab == max(cross_tab), arr.ind = TRUE)[1,1]],
" cylinders with ",
colnames(cross_tab)[which(cross_tab == max(cross_tab), arr.ind = TRUE)[1,2]],
" gears (n=", max(cross_tab), ")"))##
## Interpretation: Most common combination is 8 cylinders with 3 gears (n=12)
Why This Matters: Frequency tables are the foundation of categorical data analysis. They help you spot patterns (“Most students got A or B”), identify rare events (“Only 2 customers complained”), and prepare data for chi-square tests and other statistical analyses.
## 7.4 Data Exploration Functions
``` r
data(iris)
# Structure
str(iris)
## 'data.frame': 150 obs. of 5 variables:
## $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
## $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
## $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
## $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
## $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
## 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
## Median :5.800 Median :3.000 Median :4.350 Median :1.300
## Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
## 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
## Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
## Species
## setosa :50
## versicolor:50
## virginica :50
##
##
##
## [1] 150 5
## [1] 150
## [1] 5
## [1] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width" "Species"
## [1] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width" "Species"
## [1] setosa versicolor virginica
## Levels: setosa versicolor virginica
## [1] 3
In Plain English: Correlation measures how two variables move together. Do taller students weigh more? Does studying more lead to higher grades? Correlations range from -1 (perfect negative: as one goes up, other goes down) to +1 (perfect positive: both move together). Zero means no relationship. Covariance is similar but harder to interpret (not standardized), so we usually prefer correlation.
## [1] -0.1175698
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Sepal.Length 1.0000000 -0.1175698 0.8717538 0.8179411
## Sepal.Width -0.1175698 1.0000000 -0.4284401 -0.3661259
## Petal.Length 0.8717538 -0.4284401 1.0000000 0.9628654
## Petal.Width 0.8179411 -0.3661259 0.9628654 1.0000000
## [1] 0.6983603
## [1] -0.042434
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Sepal.Length 0.6856935 -0.0424340 1.2743154 0.5162707
## Sepal.Width -0.0424340 0.1899794 -0.3296564 -0.1216394
## Petal.Length 1.2743154 -0.3296564 3.1162779 1.2956094
## Petal.Width 0.5162707 -0.1216394 1.2956094 0.5810063
Visualizing Correlations:
par(mfrow = c(2, 2))
# 1. Positive correlation example
set.seed(123)
x_pos <- 1:50
y_pos <- 2 * x_pos + rnorm(50, 0, 10)
cor_pos <- cor(x_pos, y_pos)
plot(x_pos, y_pos,
main = paste0("Positive Correlation\nr = ", round(cor_pos, 2)),
xlab = "Study Hours",
ylab = "Grade",
pch = 19,
col = "blue")
abline(lm(y_pos ~ x_pos), col = "red", lwd = 2)
text(10, max(y_pos) - 10, "As X increases,\nY increases", col = "red", font = 2)
# 2. Negative correlation example
y_neg <- -2 * x_pos + 100 + rnorm(50, 0, 10)
cor_neg <- cor(x_pos, y_neg)
plot(x_pos, y_neg,
main = paste0("Negative Correlation\nr = ", round(cor_neg, 2)),
xlab = "Hours Watching TV",
ylab = "Grade",
pch = 19,
col = "red")
abline(lm(y_neg ~ x_pos), col = "blue", lwd = 2)
text(10, min(y_neg) + 10, "As X increases,\nY decreases", col = "blue", font = 2)
# 3. No correlation example
y_zero <- rnorm(50, 50, 15)
cor_zero <- cor(x_pos, y_zero)
plot(x_pos, y_zero,
main = paste0("No Correlation\nr = ", round(cor_zero, 2)),
xlab = "Shoe Size",
ylab = "Math Skill",
pch = 19,
col = "gray")
abline(lm(y_zero ~ x_pos), col = "darkgreen", lwd = 2)
text(25, 70, "No relationship!", col = "darkgreen", font = 2)
# 4. Real data: iris correlation matrix heatmap
cor_matrix <- cor(iris[, 1:4])
image(1:4, 1:4, cor_matrix,
col = colorRampPalette(c("blue", "white", "red"))(20),
main = "Iris Correlation Matrix Heatmap",
xlab = "", ylab = "",
axes = FALSE)
axis(1, at = 1:4, labels = colnames(iris)[1:4], las = 2, cex.axis = 0.8)
axis(2, at = 1:4, labels = colnames(iris)[1:4], las = 2, cex.axis = 0.8)
# Add correlation values
for(i in 1:4) {
for(j in 1:4) {
text(i, j, round(cor_matrix[j, i], 2),
col = ifelse(abs(cor_matrix[j, i]) > 0.5, "white", "black"),
cex = 1.2, font = 2)
}
}##
## === CORRELATION ANALYSIS ===
##
## Correlation Interpretation:
## r = +1.0 : Perfect positive (both increase together)
## r = +0.7 : Strong positive
## r = +0.3 : Weak positive
## r = 0.0 : No relationship
## r = -0.3 : Weak negative
## r = -0.7 : Strong negative
## r = -1.0 : Perfect negative (one increases, other decreases)
##
## Iris Dataset Correlations:
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Sepal.Length 1.000 -0.118 0.872 0.818
## Sepal.Width -0.118 1.000 -0.428 -0.366
## Petal.Length 0.872 -0.428 1.000 0.963
## Petal.Width 0.818 -0.366 0.963 1.000
##
## Strongest Correlations in Iris:
# Find strongest correlations (excluding diagonal)
cor_matrix_off_diag <- cor_matrix
diag(cor_matrix_off_diag) <- NA
max_cor <- max(cor_matrix_off_diag, na.rm = TRUE)
max_pos <- which(cor_matrix_off_diag == max_cor, arr.ind = TRUE)[1,]
cat(" Strongest positive:",
colnames(iris)[max_pos[1]], "and", colnames(iris)[max_pos[2]],
"=", round(max_cor, 3), "\n")## Strongest positive: Petal.Width and Petal.Length = 0.963
##
## Airquality Temperature vs Ozone:
temp_ozone_cor <- cor(airquality$Temp, airquality$Ozone, use = "complete.obs")
cat(" Correlation:", round(temp_ozone_cor, 3), "\n")## Correlation: 0.698
cat(" Interpretation: ",
ifelse(temp_ozone_cor > 0.5, "Strong positive - hotter days have more ozone",
ifelse(temp_ozone_cor > 0.3, "Moderate positive",
ifelse(temp_ozone_cor < -0.3, "Negative relationship", "Weak relationship"))),
"\n")## Interpretation: Strong positive - hotter days have more ozone
Field et al. (2012, ch. 6) on Correlation: “Correlation does NOT imply caustation! Just because two variables correlate doesn’t mean one causes the other. Ice cream sales and drowning deaths correlate (both peak in summer), but ice cream doesn’t cause drowning - temperature is the confounding variable.”
In Plain English: Functions are like recipes. You write the instructions once, then use them over and over. Instead of typing the same code repeatedly, you create a function that does the work for you. Think of it as teaching R a new trick - once taught, R can perform that trick whenever you ask!
# Basic function structure
square <- function(x) {
result <- x^2
return(result)
}
# Use the function
square(5)## [1] 25
## [1] 100
# Simplified version (R returns last value automatically)
square_simple <- function(x) {
x^2
}
square_simple(7)## [1] 49
Visualizing How Functions Work:
par(mfrow = c(2, 2))
# 1. Show input-output relationship
inputs <- 1:10
outputs <- sapply(inputs, square_simple)
plot(inputs, outputs,
type = "b",
pch = 19,
col = "blue",
main = "Function: square(x) = x²",
xlab = "Input (x)",
ylab = "Output (x²)",
cex = 1.5)
grid()
text(7, 40, "Output grows\nquadratically!", col = "red")
# 2. Compare multiple functions
cube <- function(x) x^3
sqrt_func <- function(x) sqrt(x)
x_vals <- seq(0, 5, 0.1)
plot(x_vals, x_vals, type = "l", lwd = 2, col = "black",
main = "Comparing Different Functions",
xlab = "Input (x)",
ylab = "Output",
ylim = c(0, 30))
lines(x_vals, square_simple(x_vals), col = "blue", lwd = 2)
lines(x_vals, sqrt_func(x_vals), col = "green", lwd = 2)
lines(x_vals, x_vals^3 / 10, col = "red", lwd = 2)
legend("topleft",
legend = c("x (linear)", "x² (square)", "√x (root)", "x³/10 (cube)"),
col = c("black", "blue", "green", "red"),
lwd = 2)
# 3. Function with multiple inputs demonstrated
test_values <- c(2, 4, 6, 8, 10)
results <- square_simple(test_values)
barplot(rbind(test_values, results),
beside = TRUE,
names.arg = paste0("Test", 1:5),
main = "Input vs Output",
ylab = "Value",
col = c("lightblue", "salmon"),
legend.text = c("Input", "Output"),
args.legend = list(x = "topleft"))
# 4. "Black box" concept
plot(0:1, 0:1, type = "n", axes = FALSE, xlab = "", ylab = "",
main = "Function as a 'Black Box'")
rect(0.3, 0.3, 0.7, 0.7, col = "gray", border = "black", lwd = 3)
text(0.5, 0.5, "square(x)\n\nComputes\nx²", cex = 1.2, font = 2)
arrows(0.1, 0.5, 0.28, 0.5, lwd = 3, col = "blue")
text(0.15, 0.58, "Input\nx = 5", col = "blue", font = 2)
arrows(0.72, 0.5, 0.9, 0.5, lwd = 3, col = "red")
text(0.85, 0.58, "Output\n25", col = "red", font = 2)##
## === FUNCTION DEMONSTRATION ===
## Testing square() function with different inputs:
for(i in 1:5) {
input <- i * 2
output <- square_simple(input)
cat(" square(", input, ") = ", output, "\n", sep = "")
}## square(2) = 4
## square(4) = 16
## square(6) = 36
## square(8) = 64
## square(10) = 100
Why Write Functions? (1) Don’t Repeat Yourself (DRY) - Write once, use many times. (2) Fewer Errors - Fix a bug once, fixed everywhere. (3) Easier to Read -
calculate_bmi(weight, height)is clearer than scattered formula code.
In Plain English: Just like a recipe needs multiple ingredients (flour, sugar, eggs), functions can take multiple inputs. You can even set default values - like having a default oven temperature that cooks use unless they specifically change it.
## [1] 8
## [1] 25
# Function with default values
greet <- function(name = "Friend") {
paste("Hello,", name)
}
greet("Alice")## [1] "Hello, Alice"
## [1] "Hello, Friend"
Visualizing Multiple Argument Functions:
# Create a practical multi-argument function
calculate_grade <- function(homework, midterm, final,
hw_weight = 0.3, mid_weight = 0.3, final_weight = 0.4) {
weighted_grade <- (homework * hw_weight) + (midterm * mid_weight) + (final * final_weight)
return(weighted_grade)
}
par(mfrow = c(2, 2))
# 1. Show how different inputs produce different outputs
students <- c("Alice", "Bob", "Carol", "David")
hw_scores <- c(85, 90, 78, 95)
mid_scores <- c(88, 85, 92, 90)
final_scores <- c(90, 88, 85, 92)
final_grades <- mapply(calculate_grade, hw_scores, mid_scores, final_scores)
barplot(rbind(hw_scores, mid_scores, final_scores, final_grades),
beside = TRUE,
names.arg = students,
main = "Multi-Input Function: Grade Calculation",
ylab = "Score",
col = c("lightblue", "lightgreen", "salmon", "gold"),
legend.text = c("Homework", "Midterm", "Final", "Weighted Grade"),
args.legend = list(x = "topleft", cex = 0.8))
# 2. Effect of changing default weights
alice_grades <- c(
calculate_grade(85, 88, 90), # Default weights
calculate_grade(85, 88, 90, hw_weight = 0.5, mid_weight = 0.2, final_weight = 0.3),
calculate_grade(85, 88, 90, hw_weight = 0.1, mid_weight = 0.1, final_weight = 0.8)
)
barplot(alice_grades,
names.arg = c("Default\n(0.3/0.3/0.4)", "Option1\n(0.5/0.2/0.3)", "Option2\n(0.1/0.1/0.8)"),
main = "Effect of Different Weights\n(Alice: HW=85, Mid=88, Final=90)",
ylab = "Weighted Grade",
col = rainbow(3),
ylim = c(0, 100))
text(x = 1:3 * 1.2 - 0.5, y = alice_grades + 2,
labels = round(alice_grades, 1), font = 2)
# 3. Temperature conversion function with multiple formulas
convert_temp <- function(temp, from = "F", to = "C") {
if(from == "F" && to == "C") {
return((temp - 32) * 5/9)
} else if(from == "C" && to == "F") {
return(temp * 9/5 + 32)
} else {
return(temp) # Same scale
}
}
temps_F <- seq(0, 100, 20)
temps_C <- sapply(temps_F, convert_temp, from = "F", to = "C")
plot(temps_F, temps_C,
type = "b",
pch = 19,
col = "red",
main = "Temperature Conversion Function",
xlab = "Fahrenheit",
ylab = "Celsius",
cex = 1.5)
grid()
abline(h = 0, col = "blue", lty = 2, lwd = 2)
abline(v = 32, col = "blue", lty = 2, lwd = 2)
text(32, -10, "Freezing Point", pos = 4, col = "blue")
# 4. Default vs. custom arguments comparison
bmi_calculator <- function(weight_kg, height_m, show_category = TRUE) {
bmi <- weight_kg / (height_m^2)
if(show_category) {
category <- ifelse(bmi < 18.5, "Underweight",
ifelse(bmi < 25, "Normal",
ifelse(bmi < 30, "Overweight", "Obese")))
return(list(BMI = round(bmi, 1), Category = category))
} else {
return(round(bmi, 1))
}
}
# Test with different people
people <- c("Person1", "Person2", "Person3", "Person4")
weights <- c(70, 85, 60, 95)
heights <- c(1.75, 1.80, 1.65, 1.70)
bmis <- mapply(function(w, h) bmi_calculator(w, h, show_category = FALSE),
weights, heights)
barplot(bmis,
names.arg = people,
main = "BMI Calculator Function\n(weight_kg, height_m, show_category)",
ylab = "BMI",
col = ifelse(bmis < 18.5, "lightblue",
ifelse(bmis < 25, "lightgreen",
ifelse(bmis < 30, "yellow", "red"))),
ylim = c(0, 35))
abline(h = c(18.5, 25, 30), lty = 2, col = "gray")
text(2.5, 18.5, "Underweight|Normal", pos = 3, cex = 0.8)
text(2.5, 25, "Normal|Overweight", pos = 3, cex = 0.8)
text(2.5, 30, "Overweight|Obese", pos = 3, cex = 0.8)##
## === MULTI-ARGUMENT FUNCTION DEMONSTRATION ===
##
## Grade Calculation for 4 Students:
for(i in 1:length(students)) {
cat(" ", students[i], ": HW=", hw_scores[i], ", Mid=", mid_scores[i],
", Final=", final_scores[i], " → Weighted=", round(final_grades[i], 1), "\n", sep="")
}## Alice: HW=85, Mid=88, Final=90 → Weighted=87.9
## Bob: HW=90, Mid=85, Final=88 → Weighted=87.7
## Carol: HW=78, Mid=92, Final=85 → Weighted=85
## David: HW=95, Mid=90, Final=92 → Weighted=92.3
##
## Temperature Conversions:
## 0°F = -17.8 °C
## 32°F = 0 °C
## 100°F = 37.8 °C
##
## BMI Calculations:
for(i in 1:length(people)) {
result <- bmi_calculator(weights[i], heights[i])
cat(" ", people[i], ": BMI=", result$BMI, " (", result$Category, ")\n", sep="")
}## Person1: BMI=22.9 (Normal)
## Person2: BMI=26.2 (Overweight)
## Person3: BMI=22 (Normal)
## Person4: BMI=32.9 (Obese)
Pro Tip on Default Arguments: Default values make functions easier to use. Users can call
greet()without arguments for standard behavior, but can customize by providinggreet("Bob")when needed. This is why many R functions have sensible defaults!
In Plain English: Now let’s see functions “in the wild” - real statistical calculations you’ll use throughout this course. Once you write these functions, you can reuse them in every assignment!
# Calculate z-score
z_score <- function(x, mean_val, sd_val) {
(x - mean_val) / sd_val
}
scores <- c(85, 92, 78, 90, 88)
z_score(85, mean(scores), sd(scores))## [1] -0.2930973
# Convert Fahrenheit to Celsius
f_to_c <- function(temp_f) {
temp_c <- (temp_f - 32) * 5/9
return(temp_c)
}
f_to_c(98.6)## [1] 37
## [1] 0
## [1] 100
# Function that returns multiple values (using list)
summary_stats <- function(x) {
list(
mean = mean(x, na.rm = TRUE),
median = median(x, na.rm = TRUE),
sd = sd(x, na.rm = TRUE),
min = min(x, na.rm = TRUE),
max = max(x, na.rm = TRUE)
)
}
summary_stats(scores)## $mean
## [1] 86.6
##
## $median
## [1] 88
##
## $sd
## [1] 5.458938
##
## $min
## [1] 78
##
## $max
## [1] 92
Visualizing Practical Functions:
par(mfrow = c(2, 2))
# 1. Z-score transformation
test_scores <- c(65, 75, 80, 85, 90, 95, 100)
z_scores <- (test_scores - mean(test_scores)) / sd(test_scores)
plot(test_scores, z_scores,
type = "b",
pch = 19,
col = "blue",
main = "Z-Score Transformation",
xlab = "Original Score",
ylab = "Z-Score (standard deviations)",
cex = 1.5)
abline(h = 0, col = "red", lty = 2, lwd = 2)
abline(h = c(-1, 1), col = "orange", lty = 3)
text(70, 0.2, "Mean (z=0)", col = "red")
text(70, 1.2, "+1 SD", col = "orange")
text(70, -1.2, "-1 SD", col = "orange")
grid()
# 2. Temperature conversion function
temps_F <- seq(0, 100, 10)
temps_C <- sapply(temps_F, f_to_c)
plot(temps_F, temps_C,
type = "b",
pch = 19,
col = "red",
main = "Fahrenheit to Celsius Conversion",
xlab = "Temperature (°F)",
ylab = "Temperature (°C)",
cex = 1.5,
lwd = 2)
abline(h = seq(-20, 40, 10), col = "gray", lty = 3)
abline(v = seq(0, 100, 20), col = "gray", lty = 3)
points(32, 0, pch = 19, col = "blue", cex = 2)
text(32, 0, " Freezing Point", pos = 4, col = "blue")
points(98.6, f_to_c(98.6), pch = 19, col = "darkgreen", cex = 2)
text(98.6, f_to_c(98.6), " Body Temp", pos = 4, col = "darkgreen")
# 3. Summary statistics function visualization
example_data <- rnorm(100, mean = 75, sd = 10)
stats <- summary_stats(example_data)
hist(example_data,
main = "Data Distribution with Summary Stats",
xlab = "Value",
col = "lightblue",
breaks = 20)
abline(v = stats$mean, col = "red", lwd = 3, lty = 1)
abline(v = stats$median, col = "blue", lwd = 3, lty = 2)
abline(v = stats$min, col = "gray", lwd = 2, lty = 3)
abline(v = stats$max, col = "gray", lwd = 2, lty = 3)
legend("topright",
legend = c(paste("Mean =", round(stats$mean, 1)),
paste("Median =", round(stats$median, 1)),
"Min/Max"),
col = c("red", "blue", "gray"),
lty = c(1, 2, 3),
lwd = c(3, 3, 2))
# 4. Custom grading function
grade_letter <- function(score) {
if(score >= 90) return("A")
else if(score >= 80) return("B")
else if(score >= 70) return("C")
else if(score >= 60) return("D")
else return("F")
}
student_scores <- c(95, 88, 76, 65, 92, 58, 82, 71, 85, 79)
letter_grades <- sapply(student_scores, grade_letter)
grade_table <- table(letter_grades)
barplot(grade_table,
main = "Letter Grade Distribution",
xlab = "Grade",
ylab = "Number of Students",
col = c("gold", "lightgreen", "lightblue", "orange", "red"),
ylim = c(0, max(grade_table) + 1))
text(x = 1:length(grade_table) * 1.2 - 0.5,
y = grade_table + 0.3,
labels = grade_table,
font = 2)##
## === PRACTICAL FUNCTION EXAMPLES ===
##
## 1. Z-Score Transformation:
## Original scores: 85 92 78 90 88
## Mean: 86.6 , SD: 5.46
## Z-scores:
for(i in 1:length(scores)) {
z <- z_score(scores[i], mean(scores), sd(scores))
cat(" Score", scores[i], "→ z =", round(z, 2), "\n")
}## Score 85 → z = -0.29
## Score 92 → z = 0.99
## Score 78 → z = -1.58
## Score 90 → z = 0.62
## Score 88 → z = 0.26
##
## 2. Temperature Conversions:
important_temps <- c(32, 98.6, 212)
for(temp in important_temps) {
cat(" ", temp, "°F = ", round(f_to_c(temp), 1), "°C\n", sep="")
}## 32°F = 0°C
## 98.6°F = 37°C
## 212°F = 100°C
##
## 3. Summary Statistics Function:
test_data <- c(85, 92, 78, 90, 88, 95, 82, 87)
stats_result <- summary_stats(test_data)
cat(" Data:", test_data, "\n")## Data: 85 92 78 90 88 95 82 87
## Mean: 87.12
## Median: 87.5
## SD: 5.46
## Range: 78 to 95
##
## 4. Letter Grade Conversion:
## Scores: 95 88 76 65 92 58 82 71 85 79
## Grades: A B C D A F B C B C
## Distribution: A = 2, B = 3, C = 3, D = 1, F = 1
Real-World Application: These aren’t toy examples! The
z_score()function is used in standardizing test scores, thesummary_stats()function appears in every data analysis, and custom grading functions save instructors hours of work. Functions turn repetitive tasks into one-line commands!
---
# Part 9: Combining Data
**In Plain English:** Real data often comes in pieces - one file has customer names, another has their purchases, a third has demographics. You need to combine these pieces like assembling a puzzle. R provides tools to stack data (add more rows), widen data (add more columns), and merge data (match by common IDs).
## 9.1 Combining Vectors and Data Frames
``` r
# Combine vectors as columns
x <- c(1, 2, 3)
y <- c(4, 5, 6)
cbind(x, y)
## x y
## [1,] 1 4
## [2,] 2 5
## [3,] 3 6
## [,1] [,2] [,3]
## x 1 2 3
## y 4 5 6
# Combine data frames by columns
df1 <- data.frame(ID = 1:3, Name = c("A", "B", "C"))
df2 <- data.frame(Score = c(85, 90, 78))
combined_cols <- cbind(df1, df2)
print(combined_cols)## ID Name Score
## 1 1 A 85
## 2 2 B 90
## 3 3 C 78
# Combine data frames by rows (must have same columns)
df3 <- data.frame(ID = 4:5, Name = c("D", "E"), Score = c(92, 88))
combined_rows <- rbind(combined_cols, df3)
print(combined_rows)## ID Name Score
## 1 1 A 85
## 2 2 B 90
## 3 3 C 78
## 4 4 D 92
## 5 5 E 88
Visualizing Data Combination:
par(mfrow = c(2, 2))
# 1. cbind: Side-by-side combination
vec1 <- c(10, 20, 30)
vec2 <- c(15, 25, 35)
combined_cols_demo <- cbind(vec1, vec2)
barplot(t(combined_cols_demo),
beside = FALSE,
main = "cbind(): Combines as Columns\n(Side-by-Side)",
ylab = "Value",
names.arg = c("Row1", "Row2", "Row3"),
col = c("lightblue", "salmon"),
legend.text = c("vec1", "vec2"),
args.legend = list(x = "topleft"))
# 2. rbind: Stacking combination
barplot(rbind(vec1, vec2),
beside = TRUE,
main = "rbind(): Combines as Rows\n(Stacked)",
ylab = "Value",
names.arg = c("Col1", "Col2", "Col3"),
col = c("lightblue", "salmon"),
legend.text = c("vec1", "vec2"),
args.legend = list(x = "topleft"))
# 3. Combining data frames by columns
df_demo1 <- data.frame(
Student = c("Alice", "Bob", "Carol"),
Age = c(20, 22, 21)
)
df_demo2 <- data.frame(
Grade = c(85, 92, 78),
PassFail = c("Pass", "Pass", "Fail")
)
# Show before
plot(1:2, 1:2, type = "n", xlim = c(0, 10), ylim = c(0, 10),
axes = FALSE, xlab = "", ylab = "",
main = "cbind(): Merging Data Frames by Columns")
rect(1, 4, 4, 9, col = "lightblue", border = "black", lwd = 2)
text(2.5, 7.5, "df1\nStudent\nAge", cex = 1.2, font = 2)
text(5, 6.5, "+", cex = 3, font = 2, col = "red")
rect(6, 4, 9, 9, col = "salmon", border = "black", lwd = 2)
text(7.5, 7.5, "df2\nGrade\nPassFail", cex = 1.2, font = 2)
arrows(4.5, 2.5, 4.5, 3.8, lwd = 3, col = "darkgreen")
rect(2, 0.5, 8, 2, col = "lightgreen", border = "black", lwd = 2)
text(5, 1.25, "Result: df1 + df2\n(All Columns)", cex = 1, font = 2)
# 4. Combining data frames by rows
plot(1:2, 1:2, type = "n", xlim = c(0, 10), ylim = c(0, 10),
axes = FALSE, xlab = "", ylab = "",
main = "rbind(): Stacking Data Frames by Rows")
rect(2, 7, 8, 9, col = "lightblue", border = "black", lwd = 2)
text(5, 8, "df1 (3 rows)", cex = 1.2, font = 2)
text(5, 6, "+", cex = 3, font = 2, col = "red")
rect(2, 4, 8, 5.5, col = "salmon", border = "black", lwd = 2)
text(5, 4.75, "df2 (2 more rows)", cex = 1.2, font = 2)
arrows(5, 3.2, 5, 3.8, lwd = 3, col = "darkgreen")
rect(2, 0.5, 8, 2.5, col = "lightgreen", border = "black", lwd = 2)
text(5, 1.5, "Result: df1 + df2\n(Total: 5 rows)", cex = 1, font = 2)##
## === DATA COMBINATION DEMONSTRATION ===
##
## 1. cbind() Example (Column-wise):
## vec1: 10 20 30
## vec2: 15 25 35
## Combined:
## vec1 vec2
## [1,] 10 15
## [2,] 20 25
## [3,] 30 35
##
## 2. rbind() Example (Row-wise):
## vec1: 10 20 30
## vec2: 15 25 35
## Combined:
## [,1] [,2] [,3]
## vec1 10 20 30
## vec2 15 25 35
##
## 3. Data Frame Column Combination:
## df1 dimensions: 3 rows × 2 columns
## df2 dimensions: 3 rows × 2 columns
combined_demo <- cbind(df_demo1, df_demo2)
cat(" Combined dimensions:", nrow(combined_demo), "rows ×", ncol(combined_demo), "columns\n")## Combined dimensions: 3 rows × 4 columns
## Student Age Grade PassFail
## 1 Alice 20 85 Pass
## 2 Bob 22 92 Pass
## 3 Carol 21 78 Fail
Important Rule: For
cbind(), data frames must have the same number of rows. Forrbind(), data frames must have the same columns (same names and types). If these don’t match, R will throw an error!
In Plain English: Sometimes you don’t want to combine two separate data frames - you just want to add ONE new column or ONE new row to your existing data. It’s like adding a new student to your class roster, or adding a “Final Grade” column to an existing gradebook.
# Add column to data frame
students <- data.frame(
Name = c("Alice", "Bob", "Carol"),
Age = c(20, 22, 21)
)
students$Grade <- c(85, 92, 78)
print(students)## Name Age Grade
## 1 Alice 20 85
## 2 Bob 22 92
## 3 Carol 21 78
# Add row
new_student <- data.frame(Name = "David", Age = 23, Grade = 88)
students <- rbind(students, new_student)
print(students)## Name Age Grade
## 1 Alice 20 85
## 2 Bob 22 92
## 3 Carol 21 78
## 4 David 23 88
Visualizing Adding Data:
par(mfrow = c(2, 2))
# 1. Show step-by-step column addition
# Start with basic data
class_start <- data.frame(
Name = c("Alice", "Bob", "Carol", "David"),
Age = c(20, 22, 21, 23)
)
# Add Homework column
class_with_hw <- class_start
class_with_hw$Homework <- c(85, 90, 78, 88)
# Add Exam column
class_complete <- class_with_hw
class_complete$Exam <- c(88, 85, 92, 90)
barplot(ncol(class_start):ncol(class_complete),
names.arg = c("Start\n(2 cols)", "Add HW\n(3 cols)", "Add Exam\n(4 cols)"),
main = "Adding Columns Sequentially",
ylab = "Number of Columns",
col = c("lightblue", "lightgreen", "salmon"),
ylim = c(0, 5))
text(x = 1:3 * 1.2 - 0.5,
y = ncol(class_start):ncol(class_complete) + 0.2,
labels = ncol(class_start):ncol(class_complete),
font = 2)
# 2. Show step-by-step row addition
class_3students <- data.frame(
Name = c("Alice", "Bob", "Carol"),
Grade = c(85, 90, 78)
)
class_4students <- class_3students
class_4students <- rbind(class_4students,
data.frame(Name = "David", Grade = 88))
class_5students <- class_4students
class_5students <- rbind(class_5students,
data.frame(Name = "Eve", Grade = 95))
barplot(c(nrow(class_3students), nrow(class_4students), nrow(class_5students)),
names.arg = c("Original\n(3 students)", "Add 1\n(4 students)", "Add Another\n(5 students)"),
main = "Adding Rows Sequentially",
ylab = "Number of Rows (Students)",
col = c("lightblue", "lightgreen", "salmon"),
ylim = c(0, 6))
text(x = 1:3 * 1.2 - 0.5,
y = c(nrow(class_3students), nrow(class_4students), nrow(class_5students)) + 0.2,
labels = c(nrow(class_3students), nrow(class_4students), nrow(class_5students)),
font = 2)
# 3. Visualize grade data before and after adding column
scores_before <- class_start$Age # Only have Age
scores_after <- class_complete$Homework # Now have Homework too
barplot(rbind(scores_before, scores_after),
beside = TRUE,
names.arg = class_complete$Name,
main = "Before & After Adding Grade Column",
ylab = "Value",
col = c("lightblue", "salmon"),
legend.text = c("Age (Original)", "Homework (Added)"),
args.legend = list(x = "topleft"))
# 4. Visualize student data before and after adding rows
original_grades <- c(85, 90, 78)
after_additions <- c(85, 90, 78, 88, 95)
plot(1:length(original_grades), original_grades,
type = "b",
pch = 19,
col = "blue",
main = "Dataset Growth: Adding Rows",
xlab = "Student Number",
ylab = "Grade",
xlim = c(1, 5),
ylim = c(70, 100),
cex = 2)
points((length(original_grades)+1):length(after_additions),
after_additions[(length(original_grades)+1):length(after_additions)],
pch = 19,
col = "red",
cex = 2)
lines(1:length(after_additions), after_additions, col = "gray", lty = 2)
legend("bottomright",
legend = c("Original Students", "Added Students"),
col = c("blue", "red"),
pch = 19,
cex = 1.2)##
## === ADDING DATA DEMONSTRATION ===
##
## 1. Adding Columns:
## Original data (Name, Age):
## Name Age
## 1 Alice 20
## 2 Bob 22
## 3 Carol 21
## 4 David 23
##
## After adding Homework and Exam columns:
## Name Age Homework Exam
## 1 Alice 20 85 88
## 2 Bob 22 90 85
## 3 Carol 21 78 92
## 4 David 23 88 90
##
## 2. Adding Rows:
## Original data (3 students):
## Name Grade
## 1 Alice 85
## 2 Bob 90
## 3 Carol 78
##
## After adding 2 more students:
## Name Grade
## 1 Alice 85
## 2 Bob 90
## 3 Carol 78
## 4 David 88
## 5 Eve 95
##
## 3. Tracking Growth:
## Started with: 3 rows × 2 columns
## Ended with: 5 rows × 4 columns
cat(" Added:", nrow(class_5students) - nrow(class_3students), "rows and",
ncol(class_complete) - ncol(class_start), "columns\n")## Added: 2 rows and 2 columns
Pro Tip on Growing Data: In R, repeatedly adding rows with
rbind()in a loop is SLOW for large datasets (R creates a new copy each time). If you’re adding many rows, it’s better to create a list of data frames and combine them all at once usingdo.call(rbind, my_list). But for small datasets in Week 1, don’t worry about this optimization!
In Plain English: Now let’s apply EVERYTHING you’ve learned to a real dataset! The Palmer Penguins dataset contains measurements of 344 penguins from 3 species in Antarctica. This is REAL scientific data - not made-up numbers. You’ll see how all the concepts (vectors, data frames, filtering, summarizing, visualizing) work together in actual data analysis.
Let’s put everything together with a real dataset:
# Install and load palmerpenguins package
# install.packages("palmerpenguins")
library(palmerpenguins)
# Load the penguins dataset
data(penguins)
# Explore the data
str(penguins)## tibble [344 × 8] (S3: tbl_df/tbl/data.frame)
## $ species : Factor w/ 3 levels "Adelie","Chinstrap",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ island : Factor w/ 3 levels "Biscoe","Dream",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ bill_length_mm : num [1:344] 39.1 39.5 40.3 NA 36.7 39.3 38.9 39.2 34.1 42 ...
## $ bill_depth_mm : num [1:344] 18.7 17.4 18 NA 19.3 20.6 17.8 19.6 18.1 20.2 ...
## $ flipper_length_mm: int [1:344] 181 186 195 NA 193 190 181 195 193 190 ...
## $ body_mass_g : int [1:344] 3750 3800 3250 NA 3450 3650 3625 4675 3475 4250 ...
## $ sex : Factor w/ 2 levels "female","male": 2 1 1 NA 1 2 1 2 NA NA ...
## $ year : int [1:344] 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 ...
## species island bill_length_mm bill_depth_mm
## Adelie :152 Biscoe :168 Min. :32.10 Min. :13.10
## Chinstrap: 68 Dream :124 1st Qu.:39.23 1st Qu.:15.60
## Gentoo :124 Torgersen: 52 Median :44.45 Median :17.30
## Mean :43.92 Mean :17.15
## 3rd Qu.:48.50 3rd Qu.:18.70
## Max. :59.60 Max. :21.50
## NA's :2 NA's :2
## flipper_length_mm body_mass_g sex year
## Min. :172.0 Min. :2700 female:165 Min. :2007
## 1st Qu.:190.0 1st Qu.:3550 male :168 1st Qu.:2007
## Median :197.0 Median :4050 NA's : 11 Median :2008
## Mean :200.9 Mean :4202 Mean :2008
## 3rd Qu.:213.0 3rd Qu.:4750 3rd Qu.:2009
## Max. :231.0 Max. :6300 Max. :2009
## NA's :2 NA's :2
##
## Adelie Chinstrap Gentoo
## 152 68 124
# Average bill length by species
aggregate(bill_length_mm ~ species, data = penguins, FUN = mean, na.rm = TRUE)## [1] 152
# Penguins with bill length > 50mm
long_bills <- subset(penguins, bill_length_mm > 50)
nrow(long_bills)## [1] 52
# Create a new variable: bill aspect ratio
penguins$bill_ratio <- penguins$bill_length_mm / penguins$bill_depth_mm
head(penguins$bill_ratio)## [1] 2.090909 2.270115 2.238889 NA 1.901554 1.907767
# Summary of body mass by sex
aggregate(body_mass_g ~ sex, data = penguins, FUN = mean, na.rm = TRUE)Comprehensive Penguin Data Visualization:
# Remove NAs for cleaner visualizations
penguins_clean <- na.omit(penguins)
par(mfrow = c(3, 3))
# 1. Species distribution
species_counts <- table(penguins$species)
barplot(species_counts,
main = "Penguin Species Distribution",
ylab = "Count",
col = c("darkorange", "purple", "cyan4"),
ylim = c(0, max(species_counts) + 20))
text(x = 1:3 * 1.2 - 0.5,
y = species_counts + 5,
labels = species_counts,
font = 2)
# 2. Bill length by species
boxplot(bill_length_mm ~ species, data = penguins_clean,
main = "Bill Length by Species",
xlab = "Species",
ylab = "Bill Length (mm)",
col = c("darkorange", "purple", "cyan4"))
# 3. Bill depth by species
boxplot(bill_depth_mm ~ species, data = penguins_clean,
main = "Bill Depth by Species",
xlab = "Species",
ylab = "Bill Depth (mm)",
col = c("darkorange", "purple", "cyan4"))
# 4. Body mass distribution
hist(penguins_clean$body_mass_g,
main = "Body Mass Distribution",
xlab = "Body Mass (g)",
col = "lightblue",
breaks = 20)
abline(v = mean(penguins_clean$body_mass_g), col = "red", lwd = 2, lty = 2)
# 5. Flipper length vs Body mass (scatter)
plot(penguins_clean$body_mass_g, penguins_clean$flipper_length_mm,
main = "Body Mass vs Flipper Length",
xlab = "Body Mass (g)",
ylab = "Flipper Length (mm)",
pch = 19,
col = as.numeric(penguins_clean$species))
legend("bottomright",
legend = levels(penguins_clean$species),
col = 1:3,
pch = 19,
cex = 0.8)
# 6. Bill dimensions scatter plot
plot(penguins_clean$bill_length_mm, penguins_clean$bill_depth_mm,
main = "Bill Length vs Depth",
xlab = "Bill Length (mm)",
ylab = "Bill Depth (mm)",
pch = 19,
col = c("darkorange", "purple", "cyan4")[as.numeric(penguins_clean$species)])
legend("topright",
legend = levels(penguins_clean$species),
col = c("darkorange", "purple", "cyan4"),
pch = 19,
cex = 0.8)
# 7. Body mass by species and sex
aggregate_mass <- aggregate(body_mass_g ~ species + sex,
data = penguins_clean,
FUN = mean)
mass_matrix <- matrix(aggregate_mass$body_mass_g, nrow = 2, ncol = 3)
colnames(mass_matrix) <- levels(penguins_clean$species)
rownames(mass_matrix) <- c("Female", "Male")
barplot(mass_matrix,
beside = TRUE,
main = "Body Mass by Species & Sex",
ylab = "Body Mass (g)",
col = c("pink", "lightblue"),
legend.text = TRUE,
args.legend = list(x = "topleft"))
# 8. Island distribution
island_counts <- table(penguins$island)
barplot(island_counts,
main = "Penguins by Island",
ylab = "Count",
col = terrain.colors(3),
las = 2)
# 9. Flipper length by species
boxplot(flipper_length_mm ~ species, data = penguins_clean,
main = "Flipper Length by Species",
xlab = "Species",
ylab = "Flipper Length (mm)",
col = c("darkorange", "purple", "cyan4"))par(mfrow = c(1, 1))
# Create comprehensive summary table
cat("\n=== PALMER PENGUINS DATA ANALYSIS ===\n")##
## === PALMER PENGUINS DATA ANALYSIS ===
##
## Dataset Overview:
## Total penguins: 344
## Complete cases: 333
## Number of species: 3
## Number of islands: 3
## Years covered: 2007 to 2009
##
## Species Counts:
## Adelie : 152
## Chinstrap : 68
## Gentoo : 124
##
## Average Measurements by Species:
for(sp in levels(penguins_clean$species)) {
subset_sp <- subset(penguins_clean, species == sp)
cat("\n ", sp, "Penguins:\n")
cat(" Bill Length:", round(mean(subset_sp$bill_length_mm), 1), "mm\n")
cat(" Bill Depth:", round(mean(subset_sp$bill_depth_mm), 1), "mm\n")
cat(" Flipper Length:", round(mean(subset_sp$flipper_length_mm), 1), "mm\n")
cat(" Body Mass:", round(mean(subset_sp$body_mass_g), 0), "g\n")
}##
## Adelie Penguins:
## Bill Length: 38.8 mm
## Bill Depth: 18.3 mm
## Flipper Length: 190.1 mm
## Body Mass: 3706 g
##
## Chinstrap Penguins:
## Bill Length: 48.8 mm
## Bill Depth: 18.4 mm
## Flipper Length: 195.8 mm
## Body Mass: 3733 g
##
## Gentoo Penguins:
## Bill Length: 47.6 mm
## Bill Depth: 15 mm
## Flipper Length: 217.2 mm
## Body Mass: 5092 g
##
## Sex Differences (Overall):
sex_comparison <- aggregate(body_mass_g ~ sex, data = penguins_clean, FUN = mean)
for(i in 1:nrow(sex_comparison)) {
cat(" ", sex_comparison$sex[i], ":", round(sex_comparison$body_mass_g[i], 0), "g\n")
}## 1 : 3862 g
## 2 : 4546 g
##
## Interesting Facts:
## Heaviest penguin: 6300 g
## Lightest penguin: 2700 g
## Longest bill: 59.6 mm
## Longest flippers: 231 mm
What This Demonstrates: This analysis shows you the POWER of R! With just a few commands, you’ve: 1. Loaded real scientific data (344 penguins!) 2. Explored its structure (
str(),head(),summary()) 3. Filtered subsets (Adelie penguins, long bills) 4. Created new variables (bill ratio) 5. Calculated group statistics (aggregate()) 6. Made NINE different visualizations showing patternsThis is exactly what data scientists do every day - and you just did it in Week 1!
<-read.csv() or
rio::import()head(), str(),
summary()[ ], $, or
subset()is.na()na.omit() or na.rm = TRUE?function_nameWeek 2 will build on these fundamentals:
01_lab_R_learning.Rmd
(30 exercises)Field, A., Miles, J., & Field, Z. (2012). Discovering Statistics Using R. SAGE Publications Ltd., London.
R Core Team (2024). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. https://www.R-project.org/
Wickham, H. (2016). ggplot2: Elegant Graphics for Data Analysis (2nd ed.). Springer.
Horst, A. M., Hill, A. P., & Gorman, K. B. (2020). palmerpenguins: Palmer Archipelago (Antarctica) penguin data. R package version 0.1.0. https://allisonhorst.github.io/palmerpenguins/
# Data manipulation
install.packages("dplyr")
install.packages("tidyr")
# Visualization
install.packages("ggplot2")
# Data import
install.packages("rio")
install.packages("readr")
install.packages("haven") # For SPSS, Stata, SAS
# Reporting
install.packages("rmarkdown")
install.packages("knitr")
# Practice datasets
install.packages("palmerpenguins")
install.packages("datasets")<-
assignment operator?function_name or
help(function_name)#Remember: R has a steep learning curve at first, but it gets easier with practice. You’re building a valuable skill that will serve you throughout your analytics career!
Document generated on March 11, 2026