Machine: MELAW - Windows 10 x64

Introduction

1. Welcome to R!

Welcome to your first week learning R, a powerful programming language for data analysis and statistical computing. This course is designed to help you become comfortable with R even if you’ve never programmed before. We’ll start with the fundamentals and build your skills step by step.

Key term – R: R is both a programming language and an environment for statistical computing and graphics. Originally developed by Ross Ihaka and Robert Gentleman at the University of Auckland, R has become one of the most widely used tools in data analytics (R Core Team, 2024).

2. Learning Objectives

By the end of this week, you will be able to:

Understand what R is and why it’s valuable for data analysis
Navigate the RStudio interface confidently
Create and work with different types of data (numbers, text, categories)
Store and organize data using vectors, matrices, data frames, and lists
Perform calculations and comparisons
Load data from CSV files
Access and subset specific parts of your data
Handle missing data appropriately
Use built-in R functions for analysis
Write your own simple functions

3. What is R and Why Use It?

R offers several advantages for data analytics:

Free and open-source: No licensing costs, ever
Comprehensive: Over 19,000 packages for specialized analyses
Reproducible: Your analysis can be easily shared and repeated
Visualization: Publication-quality graphics with ggplot2
Community: Extensive online support and resources
Integration: Works with databases, web APIs, and other languages

Key concept – Reproducible Research: Field, Miles, and Field (2012, ch. 3) emphasize that reproducible research allows others to verify your findings and build upon your work. Using R scripts instead of point-and-click software creates an auditable trail of your analytical decisions.

Part 1: Getting Started with R

1.1 Installing R and RStudio

Installation Steps

Download R from: https://cran.r-project.org/
- Choose your operating system (Windows, macOS, or Linux)
- Install R first
Download RStudio from: https://posit.co/downloads/
- RStudio is an Integrated Development Environment (IDE)
- Makes working with R much easier
- Install RStudio after R

The RStudio Interface

RStudio has four main panes:

Source/Script Editor (top-left): Where you write and save your code
Console (bottom-left): Where code executes and results appear
Environment/History (top-right): Shows objects you’ve created
Files/Plots/Help (bottom-right): Navigation and visualization

# You can type directly in the Console:
2 + 2

# Or write code in the Script Editor and run it (Ctrl+Enter or Cmd+Return):
x <- 10
print(x)

1.2 R as a Calculator

In Plain English: Think of R as a super-powered calculator. Before we analyze data, let’s see how R handles basic math - the same operations you’d use on a regular calculator, just typed as text.

R can perform all basic arithmetic operations:

# Addition
5 + 3

## [1] 8

# Subtraction
10 - 4

## [1] 6

# Multiplication
6 * 7

## [1] 42

# Division
20 / 4

## [1] 5

# Exponentiation (power)
2^8

## [1] 256

# Modulo (remainder after division)
17 %% 5

## [1] 2

# Order of operations (PEMDAS applies)
(5 + 3) * 2^2

## [1] 32

Why This Matters: Understanding these basic operations is essential because all statistical calculations (means, standard deviations, correlations) use these same mathematical operations behind the scenes.

Key concept – Order of Operations: Like in mathematics, R follows PEMDAS (Parentheses, Exponents, Multiplication/Division, Addition/Subtraction). Use parentheses to control calculation order (Field et al., 2012, ch. 3).

1.3 Creating and Managing Objects

In Plain English: Instead of getting results that disappear, we can save them with names (called “objects” or “variables”). Think of it like putting your calculator result in a labeled box so you can use it later. The arrow <- means “put this value into that box.”

In R, we store values in objects using the assignment operator <-:

# Create objects
A <- 5
B <- 10
name <- "R Programming"

# Use objects in calculations
A + B

## [1] 15

A * B

## [1] 50

# Check what's in your workspace
ls()

## [1] "A"    "B"    "name"

# View specific object
print(A)

## [1] 5

print(name)

## [1] "R Programming"

Visual Example: Let’s see how objects work with a simple visual:

# Create some objects
apples <- 5
oranges <- 3

# Combine them
total_fruit <- apples + oranges

cat("Apples:", apples, "\n")

## Apples: 5

cat("Oranges:", oranges, "\n")

## Oranges: 3

cat("Total fruit:", total_fruit, "\n")

## Total fruit: 8

# Create a simple bar chart to visualize
barplot(c(apples, oranges, total_fruit),
        names.arg = c("Apples", "Oranges", "Total"),
        col = c("red", "orange", "purple"),
        main = "Fruit Counts (Stored in Objects)",
        ylab = "Number of Items",
        ylim = c(0, 10))

Naming Conventions

Follow these rules for naming objects:

✅ Use letters, numbers, dots (.), and underscores (_)
✅ Start with a letter (not a number)
✅ Be descriptive: student_age is better than x
✅ Use consistent style: snake_case or camelCase
❌ Don’t use spaces
❌ Don’t use special characters (@, #, $, etc.)
⚠️ R is case-sensitive: data and Data are different!

# Good names
student_age <- 20
total_score <- 95
mean_temperature <- 72.5

# Avoid these
x <- 20              # Not descriptive
my.variable <- 10    # Dots are discouraged in modern R
2ndScore <- 85       # Can't start with number (ERROR!)

Removing Objects

# Remove a specific object
temp_value <- 100
rm(temp_value)

# Remove multiple objects
x <- 1
y <- 2
z <- 3
rm(x, y, z)

# Remove all objects (use with caution!)
# rm(list = ls())

Part 2: Data Types in R

In Plain English: Just like we categorize things in real life (counting numbers,names, yes/no questions), R recognizes different types of data. The type tells R how to handle and display the data. Getting the type right is crucial - you can’t calculate an average of names or sort numbers alphabetically (well, you can, but it won’t make sense!).

2.1 Basic Data Types

R recognizes several fundamental data types:

# Numeric (numbers)
age <- 25
height <- 175.5
class(age)          # Check data type

## [1] "numeric"

class(height)

## [1] "numeric"

# Character (text)
name <- "John Doe"
species <- "Adelie"
class(name)

## [1] "character"

# Logical (TRUE/FALSE)
is_student <- TRUE
has_passed <- FALSE
class(is_student)

## [1] "logical"

# Integer (whole numbers)
count <- 42L        # The 'L' specifies integer
class(count)

## [1] "integer"

Visual Comparison of Data Types:

# Create examples of each type
my_number <- 42.5
my_text <- "Hello"
my_logical <- TRUE
my_integer <- 100L

# Display them in a comparison table
comparison_data <- data.frame(
  Type = c("Numeric", "Character", "Logical", "Integer"),
  Example = c("42.5", '"Hello"', "TRUE", "100L"),
  UsedFor = c("Measurements", "Names/Labels", "Yes/No", "Counts"),
  CanDoMath = c("Yes", "No", "Sort of", "Yes")
)

knitr::kable(comparison_data, 
             caption = "Understanding R's Basic Data Types",
             align = c('l', 'l', 'l', 'c'))

Understanding R’s Basic Data Types
Type	Example	UsedFor	CanDoMath
Numeric	42.5	Measurements	Yes
Character	“Hello”	Names/Labels	No
Logical	TRUE	Yes/No	Sort of
Integer	100L	Counts	Yes

Why Data Types Matter:

# This works (both numeric)
5 + 3

## [1] 8

# This doesn't work as expected (mixing types)
tryCatch({
  "5" + 3  # Text "5" plus number 3
}, error = function(e) {
  cat("ERROR:", conditionMessage(e), "\n\n")
})

## ERROR: non-numeric argument to binary operator

# You must convert first
as.numeric("5") + 3  # Convert text to number, then add

## [1] 8

# Visual: Show what happens with wrong types
par(mfrow = c(1, 2))

# Correct: Numeric data in histogram
correct_data <- c(5, 10, 15, 20, 25)
hist(correct_data, 
     main = "Correct: Numeric Data",
     xlab = "Values",
     col = "lightgreen",
     border = "white")

# Wrong type creates problems
mixed_data <- c("5", "10", "15", "20", "25")
cat("Trying to plot text data doesn't work well:\n")

## Trying to plot text data doesn't work well:

cat("Data type:", class(mixed_data), "\n")

## Data type: character

par(mfrow = c(1, 1))

Type Conversion

You can convert between types using as.*() functions:

# Numeric to character
num <- 42
char_num <- as.character(num)
class(char_num)

## [1] "character"

# Character to numeric
text_num <- "123"
converted <- as.numeric(text_num)
class(converted)

## [1] "numeric"

# Numeric to logical
as.logical(1)       # TRUE

## [1] TRUE

as.logical(0)       # FALSE

## [1] FALSE

2.2 Special Values

R has special values for unusual situations:

# NA - Missing/Not Available
missing_value <- NA
is.na(missing_value)

## [1] TRUE

# NULL - Empty/Undefined
empty_value <- NULL
is.null(empty_value)

## [1] TRUE

# Inf - Infinity
positive_inf <- 1/0
negative_inf <- -1/0
print(positive_inf)

## [1] Inf

# NaN - Not a Number (undefined mathematical operations)
undefined <- 0/0
is.nan(undefined)

## [1] TRUE

2.3 Factors

Factors represent categorical data:

# Create a factor
species <- factor(c("Adelie", "Gentoo", "Chinstrap", "Adelie", "Gentoo"))
print(species)

## [1] Adelie    Gentoo    Chinstrap Adelie    Gentoo   
## Levels: Adelie Chinstrap Gentoo

# Check levels
levels(species)

## [1] "Adelie"    "Chinstrap" "Gentoo"

# Specify level order (useful for ordinal data)
satisfaction <- factor(
  c("Low", "High", "Medium", "Low", "High"),
  levels = c("Low", "Medium", "High"),
  ordered = TRUE
)
print(satisfaction)

## [1] Low    High   Medium Low    High  
## Levels: Low < Medium < High

Key concept – Factors: Field, Miles, and Field (2012, ch. 3) explain that factors are essential for categorical variables in R. Unlike character strings, factors store levels, which is memory-efficient and necessary for many statistical models.

Part 3: Data Structures

In Plain English: Data structures are different ways to organize information, like different types of containers. You wouldn’t store soup in a paper bag or carry rocks in a colander - similarly, we choose the right data structure for our data. Vectors are like a row of boxes, matrices are like spreadsheets, data frames are like tables where columns can hold different kinds of things, and lists are like filing cabinets that can hold anything.

3.1 Vectors

What is a Vector? A vector is the simplest data structure - think of it as a single row or column of values, where all values are the same type (all numbers, or all text, etc.).

Vectors are collections of elements of the same type:

# Numeric vector
ages <- c(23, 25, 31, 28, 26)
print(ages)

## [1] 23 25 31 28 26

# Character vector
names <- c("Alice", "Bob", "Carol", "David", "Eve")
print(names)

## [1] "Alice" "Bob"   "Carol" "David" "Eve"

# Logical vector
passed <- c(TRUE, TRUE, FALSE, TRUE, FALSE)
print(passed)

## [1]  TRUE  TRUE FALSE  TRUE FALSE

# Sequences
seq1 <- 1:10                    # 1 to 10
seq2 <- seq(0, 100, by = 10)    # 0 to 100 by 10
rep1 <- rep(5, times = 3)       # Repeat 5 three times

print(seq1)

##  [1]  1  2  3  4  5  6  7  8  9 10

print(seq2)

##  [1]   0  10  20  30  40  50  60  70  80  90 100

print(rep1)

## [1] 5 5 5

Visualizing Vectors:

# Create a numeric vector
test_scores <- c(85, 92, 78, 90, 88, 95, 82, 87, 91, 86)

# Show multiple views
par(mfrow = c(2, 2))

# 1. Simple plot
plot(test_scores, 
     type = "b",  # both points and lines
     main = "Test Scores (as a Sequence)",
     xlab = "Student Number",
     ylab = "Score",
     col = "blue",
     pch = 19,
     ylim = c(75, 100))

# 2. Histogram
hist(test_scores,
     main = "Distribution of Scores",
     xlab = "Score",
     col = "lightblue",
     border = "white",
     breaks = 5)

# 3. Boxplot
boxplot(test_scores,
        main = "Score Summary (Boxplot)",
        ylab = "Score",
        col = "lightgreen",
        horizontal = FALSE)

# 4. Barplot
barplot(test_scores,
        names.arg = 1:length(test_scores),
        main = "Individual Student Scores",
        xlab = "Student",
        ylab = "Score",
        col = rainbow(length(test_scores)))

par(mfrow = c(1, 1))

cat("\nSummary Statistics:\n")

## 
## Summary Statistics:

cat("Mean:", mean(test_scores), "\n")

## Mean: 87.4

cat("Median:", median(test_scores), "\n")

## Median: 87.5

cat("Minimum:", min(test_scores), "\n")

## Minimum: 78

cat("Maximum:", max(test_scores), "\n")

## Maximum: 95

Vector Operations

# Element-wise operations
x <- c(1, 2, 3, 4, 5)
y <- c(10, 20, 30, 40, 50)

x + y         # Addition

## [1] 11 22 33 44 55

x * y         # Multiplication

## [1]  10  40  90 160 250

x^2           # Square each element

## [1]  1  4  9 16 25

# Summary statistics
mean(x)

## [1] 3

sum(x)

## [1] 15

sd(x)         # Standard deviation

## [1] 1.581139

length(x)     # Number of elements

## [1] 5

Vector Subsetting

In Plain English: Subsetting means “picking out specific pieces.” Imagine you have a row of 10 boxes numbered 1-10. Subsetting lets you say “give me box 3” or “give me all boxes with values greater than 50.” This is one of the most powerful features in R!

# Create a vector
scores <- c(85, 92, 78, 90, 88)

# Access by position
scores[1]           # First element

## [1] 85

scores[3]           # Third element

## [1] 78

scores[c(1, 3, 5)]  # Multiple elements

## [1] 85 78 88

# Access by condition
scores[scores > 85]          # Scores greater than 85

## [1] 92 90 88

scores[scores >= 85 & scores <= 90]  # Between 85 and 90

## [1] 85 90 88

# Negative indexing (exclude elements)
scores[-2]          # All except second element

## [1] 85 78 90 88

Visualizing Subsetting:

# Create example data
all_scores <- c(72, 85, 92, 78, 90, 88, 95, 82)
student_names <- c("Amy", "Ben", "Chris", "Dana", "Eve", "Frank", "Grace", "Henry")

# Show all scores
par(mfrow = c(2, 2))

# 1. All data
barplot(all_scores,
        names.arg = student_names,
        main = "All Students",
        ylab = "Score",
        col = "lightgray",
        las = 2)  # Rotate labels
abline(h = 85, col = "red", lty = 2, lwd = 2)
text(4, 87, "Pass threshold = 85", col = "red")

# 2. Subset: Scores > 85
high_scores <- all_scores[all_scores > 85]
high_names <- student_names[all_scores > 85]
barplot(high_scores,
        names.arg = high_names,
        main = "Only High Scorers (> 85)",
        ylab = "Score",
        col = "lightgreen",
        las = 2)

# 3. Subset: First 3 students
barplot(all_scores[1:3],
        names.arg = student_names[1:3],
        main = "First 3 Students",
        ylab = "Score",
        col = "lightblue",
        las = 2)

# 4. Subset: Specific positions
selected <- c(2, 5, 7)
barplot(all_scores[selected],
        names.arg = student_names[selected],
        main = "Selected Students (positions 2, 5, 7)",
        ylab = "Score",
        col = "lightyellow",
        las = 2)

par(mfrow = c(1, 1))

3.2 Matrices

In Plain English: A matrix is like a spreadsheet where EVERY cell contains the same type of data (all numbers, for example). Think of it as multiple vectors stacked together in rows and columns. Matrices are great for mathematical operations but limited because you can’t mix types.

Matrices are 2-dimensional arrays with rows and columns:

# Create a matrix
my_matrix <- matrix(
  data = 1:12,
  nrow = 4,
  ncol = 3,
  byrow = FALSE    # Fill by column (default)
)
print(my_matrix)

##      [,1] [,2] [,3]
## [1,]    1    5    9
## [2,]    2    6   10
## [3,]    3    7   11
## [4,]    4    8   12

# Matrix by row
matrix_byrow <- matrix(
  data = 1:12,
  nrow = 4,
  ncol = 3,
  byrow = TRUE
)
print(matrix_byrow)

##      [,1] [,2] [,3]
## [1,]    1    2    3
## [2,]    4    5    6
## [3,]    7    8    9
## [4,]   10   11   12

# Matrix dimensions
dim(my_matrix)

## [1] 4 3

nrow(my_matrix)

## [1] 4

ncol(my_matrix)

## [1] 3

Visualizing Matrix Structure:

# Create a matrix with meaningful data
grades_matrix <- matrix(
  c(85, 90, 78,    # Row 1: Test scores for Student 1
    92, 88, 95,    # Row 2: Test scores for Student 2
    78, 85, 82,    # Row 3: Test scores for Student 3
    90, 87, 91),   # Row 4: Test scores for Student 4
  nrow = 4,
  ncol = 3,
  byrow = TRUE
)

# Add labels
rownames(grades_matrix) <- c("Student1", "Student2", "Student3", "Student4")
colnames(grades_matrix) <- c("Test1", "Test2", "Test3")

print(grades_matrix)

##          Test1 Test2 Test3
## Student1    85    90    78
## Student2    92    88    95
## Student3    78    85    82
## Student4    90    87    91

# Visualize the matrix
par(mfrow = c(1, 2))

# 1. Heatmap view
image(t(grades_matrix),
      main = "Grade Matrix (Heatmap)",
      xlab = "Tests",
      ylab = "Students",
      col = heat.colors(20),
      axes = FALSE)
axis(1, at = seq(0, 1, length.out = 3), labels = colnames(grades_matrix))
axis(2, at = seq(0, 1, length.out = 4), labels = rownames(grades_matrix), las = 2)

# 2. Grouped barplot
barplot(t(grades_matrix),
        beside = TRUE,
        main = "Grades by Student and Test",
        xlab = "Student",
        ylab = "Score",
        col = c("skyblue", "lightgreen", "salmon"),
        legend.text = colnames(grades_matrix),
        args.legend = list(x = "topright", cex = 0.8))

par(mfrow = c(1, 1))

Matrix Operations

# Subsetting matrices
my_matrix[2, 3]      # Row 2, Column 3

## [1] 10

my_matrix[1, ]       # First row, all columns

## [1] 1 5 9

my_matrix[, 2]       # All rows, second column

## [1] 5 6 7 8

my_matrix[1:2, 2:3]  # Rows 1-2, Columns 2-3

##      [,1] [,2]
## [1,]    5    9
## [2,]    6   10

# Matrix arithmetic
mat1 <- matrix(1:4, nrow = 2)
mat2 <- matrix(5:8, nrow = 2)

mat1 + mat2          # Element-wise addition

##      [,1] [,2]
## [1,]    6   10
## [2,]    8   12

mat1 * mat2          # Element-wise multiplication

##      [,1] [,2]
## [1,]    5   21
## [2,]   12   32

mat1 %*% mat2        # Matrix multiplication

##      [,1] [,2]
## [1,]   23   31
## [2,]   34   46

# Transpose
t(mat1)

##      [,1] [,2]
## [1,]    1    2
## [2,]    3    4

3.3 Data Frames

In Plain English: Data frames are the MOST IMPORTANT data structure you’ll use! Think of them like Excel spreadsheets - they have rows and columns, BUT unlike matrices, each column can be a different type (one column can be names, another can be ages, another can be yes/no values). Most real-world data comes in data frames.

Data frames are the most common structure for datasets:

# Create a data frame
students <- data.frame(
  Name = c("Alice", "Bob", "Carol", "David"),
  Age = c(20, 22, 21, 23),
  Grade = c(85, 92, 78, 88),
  Passed = c(TRUE, TRUE, TRUE, TRUE)
)

print(students)

##    Name Age Grade Passed
## 1 Alice  20    85   TRUE
## 2   Bob  22    92   TRUE
## 3 Carol  21    78   TRUE
## 4 David  23    88   TRUE

# View structure
str(students)

## 'data.frame':    4 obs. of  4 variables:
##  $ Name  : chr  "Alice" "Bob" "Carol" "David"
##  $ Age   : num  20 22 21 23
##  $ Grade : num  85 92 78 88
##  $ Passed: logi  TRUE TRUE TRUE TRUE

# View dimensions
dim(students)

## [1] 4 4

nrow(students)

## [1] 4

ncol(students)

## [1] 4

# Column names
names(students)

## [1] "Name"   "Age"    "Grade"  "Passed"

colnames(students)

## [1] "Name"   "Age"    "Grade"  "Passed"

Visualizing Data Frame Concepts:

# Create a more detailed data frame
class_data <- data.frame(
  Student = c("Alice", "Bob", "Carol", "David", "Eve", "Frank"),
  Age = c(20, 22, 21, 23, 20, 22),
  Grade = c(85, 92, 78, 88, 95, 82),
  StudyHours = c(15, 20, 10, 18, 25, 12),
  PassFail = c("Pass", "Pass", "Fail", "Pass", "Pass", "Pass")
)

# Show it as a nice table
knitr::kable(class_data, 
             caption = "Example Data Frame: Student Information",
             align = 'c')

Example Data Frame: Student Information
Student	Age	Grade	StudyHours	PassFail
Alice	20	85	15	Pass
Bob	22	92	20	Pass
Carol	21	78	10	Fail
David	23	88	18	Pass
Eve	20	95	25	Pass
Frank	22	82	12	Pass

# Multiple visualizations from ONE data frame
par(mfrow = c(2, 2))

# 1. Grades distribution
hist(class_data$Grade,
     main = "Grade Distribution",
     xlab = "Grade",
     col = "lightblue",
     border = "white",
     breaks = 5)
abline(v = mean(class_data$Grade), col = "red", lwd = 2, lty = 2)
legend("topright", "Mean", col = "red", lty = 2, lwd = 2)

# 2. Study hours vs Grade
plot(class_data$StudyHours, class_data$Grade,
     main = "Study Hours vs Grade",
     xlab = "Study Hours per Week", 
     ylab = "Grade",
     pch = 19,
     col = "darkgreen",
     cex = 1.5)
abline(lm(Grade ~ StudyHours, data = class_data), col = "red", lwd = 2)

# 3. Age distribution
barplot(table(class_data$Age),
        main = "Age Distribution",
        xlab = "Age",
        ylab = "Number of Students",
        col = "salmon")

# 4. Pass/Fail summary
pass_fail_count <- table(class_data$PassFail)
barplot(pass_fail_count,
        main = "Pass/Fail Summary",
        ylab = "Count",
        col = c("red", "green"),
        ylim = c(0, max(pass_fail_count) + 1))

par(mfrow = c(1, 1))

cat("\n=== DATA FRAME SUMMARY ===\n")

## 
## === DATA FRAME SUMMARY ===

cat("Number of students:", nrow(class_data), "\n")

## Number of students: 6

cat("Number of variables:", ncol(class_data), "\n")

## Number of variables: 5

cat("Average grade:", round(mean(class_data$Grade), 1), "\n")

## Average grade: 86.7

cat("Average study hours:", round(mean(class_data$StudyHours), 1), "\n")

## Average study hours: 16.7

Accessing Data Frame Elements

# Access columns
students$Name              # Using $

## [1] "Alice" "Bob"   "Carol" "David"

students[, "Grade"]        # Using column name

## [1] 85 92 78 88

students[, 3]              # Using position

## [1] 85 92 78 88

# Access rows
students[1, ]              # First row

students[c(1, 3), ]        # Rows 1 and 3

# Access specific cells
students[2, 3]             # Row 2, Column 3

## [1] 92

students$Grade[2]          # Second value in Grade column

## [1] 92

Filtering Data Frames

In Plain English: Filtering means “show me only the rows that meet certain conditions” - like using a filter in Excel. This is incredibly powerful for analyzing specific subsets of your data!

# Filter by condition
students[students$Age > 21, ]

students[students$Grade >= 85, ]

# Multiple conditions (AND)
students[students$Age > 20 & students$Grade > 80, ]

# Multiple conditions (OR)
students[students$Age < 21 | students$Grade > 90, ]

# Using subset() function
subset(students, Age > 21)

subset(students, Grade >= 85 & Passed == TRUE)

Visualizing Filtering Effects:

# Create sample data
all_students <- data.frame(
  Name = c("Amy", "Ben", "Chris", "Dana", "Eve", "Frank", "Grace", "Henry"),
  Score = c(72, 85, 92, 78, 90, 88, 95, 82),
  Attendance = c(75, 90, 95, 80, 92, 88, 98, 85)
)

# Define filtering criteria
high_performers <- all_students[all_students$Score >= 85 & all_students$Attendance >= 88, ]

# Show the filtering visually
par(mfrow = c(1, 2))

# Before filtering
plot(all_students$Attendance, all_students$Score,
     main = "All Students",
     xlab = "Attendance %",
     ylab = "Score",
     pch = 19,
     cex = 2,
     col = "lightgray",
     xlim = c(70, 100),
     ylim = c(70, 100))
abline(h = 85, col = "red", lty = 2, lwd = 2)
abline(v = 88, col = "blue", lty = 2, lwd = 2)
text(75, 97, "Score ≥ 85", col = "red")
text(95, 73, "Attendance ≥ 88", col = "blue", srt = 90)
legend("bottomright", 
       legend = c("All students", "Criteria lines"),
       col = c("lightgray", "red"),
       pch = c(19, NA),
       lty = c(NA, 2))

# After filtering
plot(high_performers$Attendance, high_performers$Score,
     main = "High Performers Only\n(Score ≥ 85 AND Attendance ≥ 88)",
     xlab = "Attendance %",
     ylab = "Score",
     pch = 19,
     cex = 2,
     col = "darkgreen",
     xlim = c(70, 100),
     ylim = c(70, 100))
text(high_performers$Attendance, high_performers$Score, 
     labels = high_performers$Name,
     pos = 3,
     cex = 0.8)

par(mfrow = c(1, 1))

cat("\n=== FILTERING RESULTS ===\n")

## 
## === FILTERING RESULTS ===

cat("Original:", nrow(all_students), "students\n")

## Original: 8 students

cat("After filtering:", nrow(high_performers), "students\n")

## After filtering: 5 students

cat("Filtered out:", nrow(all_students) - nrow(high_performers), "students\n")

## Filtered out: 3 students

Important: The subset() function automatically removes rows with NA values. Use bracket notation [ ] if you want to keep NAs.

3.4 Lists

Lists can contain different types and structures:

# Create a list
my_list <- list(
  numbers = 1:5,
  text = c("a", "b", "c"),
  logical = c(TRUE, FALSE),
  matrix = matrix(1:4, nrow = 2),
  dataframe = students
)

# View structure
str(my_list)

## List of 5
##  $ numbers  : int [1:5] 1 2 3 4 5
##  $ text     : chr [1:3] "a" "b" "c"
##  $ logical  : logi [1:2] TRUE FALSE
##  $ matrix   : int [1:2, 1:2] 1 2 3 4
##  $ dataframe:'data.frame':   4 obs. of  4 variables:
##   ..$ Name  : chr [1:4] "Alice" "Bob" "Carol" "David"
##   ..$ Age   : num [1:4] 20 22 21 23
##   ..$ Grade : num [1:4] 85 92 78 88
##   ..$ Passed: logi [1:4] TRUE TRUE TRUE TRUE

# Access list elements
my_list$numbers              # Using name

## [1] 1 2 3 4 5

my_list[[1]]                 # Using position (double brackets)

## [1] 1 2 3 4 5

my_list[["text"]]            # Using name (double brackets)

## [1] "a" "b" "c"

# Access nested elements
my_list$dataframe$Name       # Names from dataframe in list

## [1] "Alice" "Bob"   "Carol" "David"

Part 4: Working with Data Files

4.1 Working Directory

Before importing data, know where R is looking for files:

# Check current working directory
getwd()

## [1] "D:/Github/data_sciences/ANLY500-Analytics-I/Week01"

# Set working directory (use forward slashes or double backslashes)
# setwd("C:/Users/YourName/Documents/R_Projects")
# setwd("C:\\Users\\YourName\\Documents\\R_Projects")

Best Practices for File Paths

Use R Projects (recommended): Automatically sets working directory
Use R Markdown: Automatically uses the file’s location
Use relative paths: "data/myfile.csv" instead of full paths
Avoid setwd(): Breaks when you share code

4.2 Importing Data

Using Base R

# Read CSV files
data_csv <- read.csv("data/mydata.csv")

# Read tab-delimited files
data_tab <- read.table("data/mydata.txt", header = TRUE, sep = "\t")

# Read from clipboard (Windows)
data_clipboard <- read.table("clipboard", header = TRUE)

Using rio Package (Recommended)

The rio package handles many file formats automatically:

# Install rio if needed
# install.packages("rio")

library(rio)

# Import various formats
data <- import("data/mydata.csv")      # CSV
data <- import("data/mydata.xlsx")     # Excel
data <- import("data/mydata.sav")      # SPSS
data <- import("data/mydata.dta")      # Stata
data <- import("data/mydata.rds")      # R data

4.3 Exploring Your Data

Once data is loaded, explore it:

# Use built-in airquality dataset
data(airquality)

# View first few rows
head(airquality)

# View last few rows
tail(airquality)

# Structure
str(airquality)

## 'data.frame':    153 obs. of  6 variables:
##  $ Ozone  : int  41 36 12 18 NA 28 23 19 8 NA ...
##  $ Solar.R: int  190 118 149 313 NA NA 299 99 19 194 ...
##  $ Wind   : num  7.4 8 12.6 11.5 14.3 14.9 8.6 13.8 20.1 8.6 ...
##  $ Temp   : int  67 72 74 62 56 66 65 59 61 69 ...
##  $ Month  : int  5 5 5 5 5 5 5 5 5 5 ...
##  $ Day    : int  1 2 3 4 5 6 7 8 9 10 ...

# Summary statistics
summary(airquality)

##      Ozone           Solar.R           Wind             Temp      
##  Min.   :  1.00   Min.   :  7.0   Min.   : 1.700   Min.   :56.00  
##  1st Qu.: 18.00   1st Qu.:115.8   1st Qu.: 7.400   1st Qu.:72.00  
##  Median : 31.50   Median :205.0   Median : 9.700   Median :79.00  
##  Mean   : 42.13   Mean   :185.9   Mean   : 9.958   Mean   :77.88  
##  3rd Qu.: 63.25   3rd Qu.:258.8   3rd Qu.:11.500   3rd Qu.:85.00  
##  Max.   :168.00   Max.   :334.0   Max.   :20.700   Max.   :97.00  
##  NA's   :37       NA's   :7                                       
##      Month            Day      
##  Min.   :5.000   Min.   : 1.0  
##  1st Qu.:6.000   1st Qu.: 8.0  
##  Median :7.000   Median :16.0  
##  Mean   :6.993   Mean   :15.8  
##  3rd Qu.:8.000   3rd Qu.:23.0  
##  Max.   :9.000   Max.   :31.0  
##

# Dimensions
dim(airquality)

## [1] 153   6

nrow(airquality)

## [1] 153

ncol(airquality)

## [1] 6

# Column names
names(airquality)

## [1] "Ozone"   "Solar.R" "Wind"    "Temp"    "Month"   "Day"

Interactive Viewing

# Open data viewer (in RStudio)
View(airquality)

# Print specific summaries
summary(airquality$Temp)
table(airquality$Month)

Part 5: Basic Operations

In Plain English: Before we get to statistics, you need to understand how R compares values and makes logical decisions. These operations are the foundation of filtering data (“show me all students who scored > 85”) and conditional calculations (“if age < 18, then minor”).

5.1 Arithmetic Operations

# Basic operations
10 + 5

## [1] 15

10 - 5

## [1] 5

10 * 5

## [1] 50

10 / 5

## [1] 2

10^2

## [1] 100

# More complex
(10 + 5) * 2

## [1] 30

10 + 5 * 2        # Multiplication before addition

## [1] 20

# Functions
sqrt(16)          # Square root

## [1] 4

abs(-10)          # Absolute value

## [1] 10

log(10)           # Natural log

## [1] 2.302585

log10(100)        # Base-10 log

## [1] 2

exp(1)            # e^1

## [1] 2.718282

5.2 Comparison Operators

In Plain English: Comparison operators ask questions and return TRUE or FALSE. Think of them like a quality control inspector: “Is this product weight > 100 grams?” YES (TRUE) or NO (FALSE). These TRUE/FALSE answers then let you filter and select data.

Comparisons return TRUE or FALSE:

# Equality and inequality
5 == 5            # Equal to (two equals signs!)

## [1] TRUE

5 != 3            # Not equal to

## [1] TRUE

# Greater and less than
10 > 5

## [1] TRUE

10 < 5

## [1] FALSE

10 >= 10

## [1] TRUE

10 <= 9

## [1] FALSE

# Compare vectors element-wise
x <- c(1, 2, 3, 4, 5)
x > 3

## [1] FALSE FALSE FALSE  TRUE  TRUE

x == 3

## [1] FALSE FALSE  TRUE FALSE FALSE

Visualizing Comparison Operations:

test_scores <- c(72, 85, 91, 68, 95, 78, 88, 82)
threshold <- 80

par(mfrow = c(2, 2))

# 1. Visual representation of scores vs threshold
barplot(test_scores,
        names.arg = paste0("S", 1:length(test_scores)),
        main = "Scores vs Threshold (80)",
        ylab = "Score",
        col = ifelse(test_scores >= threshold, "green", "red"),
        ylim = c(0, 100))
abline(h = threshold, col = "blue", lwd = 3, lty = 2)
text(4, threshold + 3, "Threshold = 80", col = "blue", font = 2)
legend("bottomright",
       legend = c("Passed (≥80)", "Failed (<80)"),
       fill = c("green", "red"))

# 2. Truth table for ==, >, <
comparison_results <- data.frame(
  Value = test_scores,
  "Equal_80" = test_scores == 80,
  "Greater_80" = test_scores > 80,
  "Less_80" = test_scores < 80,
  "GreaterEqual_80" = test_scores >= 80
)

# Show counts
counts <- c(sum(test_scores == 80),
           sum(test_scores > 80),
           sum(test_scores < 80),
           sum(test_scores >= 80))

barplot(counts,
        names.arg = c("== 80", "> 80", "< 80", "≥ 80"),
        main = "Comparison Operations Count",
        ylab = "Number of Students",
        col = rainbow(4),
        ylim = c(0, max(counts) + 1))
text(x = 1:4 * 1.2 - 0.5, y = counts + 0.3, labels = counts, font = 2)

# 3. Element-wise comparison visualization
x_demo <- c(1, 2, 3, 4, 5)
compare_3 <- x_demo > 3

barplot(rbind(x_demo, compare_3 * 5),  # Scale TRUE/FALSE to 5 for visibility
        beside = TRUE,
        names.arg = paste0("x[", 1:5, "]"),
        main = "Element-wise: x > 3",
        ylab = "Value",
        col = c("lightblue", "salmon"),
        legend.text = c("Original Value", "x > 3 (TRUE=5, FALSE=0)"),
        args.legend = list(x = "topleft", cex = 0.8))

# 4. Multiple comparisons on same data
age_data <- c(15, 18, 22, 17, 25, 30, 16, 21)
comparisons_matrix <- rbind(
  "< 18" = sum(age_data < 18),
  "18-21" = sum(age_data >= 18 & age_data <= 21),
  "21-25" = sum(age_data > 21 & age_data <= 25),
  "> 25" = sum(age_data > 25)
)

barplot(comparisons_matrix,
        main = "Age Groups Using Comparisons",
        xlab = "Age Range",
        ylab = "Count",
        col = c("lightblue", "lightgreen", "yellow", "salmon"),
        ylim = c(0, max(comparisons_matrix) + 1))
text(x = 0.7, y = comparisons_matrix + 0.2, labels = comparisons_matrix, font = 2)

par(mfrow = c(1, 1))

cat("\n=== COMPARISON OPERATIONS DEMONSTRATION ===\n")

## 
## === COMPARISON OPERATIONS DEMONSTRATION ===

cat("\nTest Scores:", test_scores, "\n")

## 
## Test Scores: 72 85 91 68 95 78 88 82

cat("Threshold:", threshold, "\n\n")

## Threshold: 80

cat("Comparison Results:\n")

## Comparison Results:

cat("  Scores == 80:", sum(test_scores == 80), "students\n")

##   Scores == 80: 0 students

cat("  Scores > 80:", sum(test_scores > 80), "students\n")

##   Scores > 80: 5 students

cat("  Scores < 80:", sum(test_scores < 80), "students\n")

##   Scores < 80: 3 students

cat("  Scores >= 80:", sum(test_scores >= 80), "students (PASSED)\n")

##   Scores >= 80: 5 students (PASSED)

cat("  Scores <= 79:", sum(test_scores < 80), "students (FAILED)\n")

##   Scores <= 79: 3 students (FAILED)

cat("\nWhich students passed (>=80)?\n")

## 
## Which students passed (>=80)?

passed <- which(test_scores >= 80)
cat("  Students:", paste0("S", passed, collapse=", "), "\n")

##   Students: S2, S3, S5, S7, S8

cat("  Their scores:", test_scores[passed], "\n")

##   Their scores: 85 91 95 88 82

5.3 Logical Operators

In Plain English: Logical operators combine multiple TRUE/FALSE questions. Think of airport security: “Do you have a ticket AND a passport?” Both must be TRUE. Or think of discounts: “Students OR seniors get 20% off” - either one being TRUE gets the discount. These operators let you create complex filters like “students who scored > 85 AND attended > 90% of classes.”

Combine conditions:

# AND operator (&)
TRUE & TRUE       # TRUE

## [1] TRUE

TRUE & FALSE      # FALSE

## [1] FALSE

# OR operator (|)
TRUE | FALSE      # TRUE

## [1] TRUE

FALSE | FALSE     # FALSE

## [1] FALSE

# NOT operator (!)
!TRUE             # FALSE

## [1] FALSE

!FALSE            # TRUE

## [1] TRUE

# Combining conditions
age <- 25
age > 18 & age < 30              # Between 18 and 30

## [1] TRUE

age < 18 | age > 65              # Less than 18 OR greater than 65

## [1] FALSE

Visualizing Logical Operators:

par(mfrow = c(2, 2))

# 1. AND operator truth table visualization
and_results <- c(
  "T & T" = 1,   # TRUE
  "T & F" = 0,   # FALSE
  "F & T" = 0,   # FALSE
  "F & F" = 0    # FALSE
)

barplot(and_results,
        main = "AND Operator (&)\nBoth Must Be TRUE",
        ylab = "Result (1=TRUE, 0=FALSE)",
        col = ifelse(and_results == 1, "green", "red"),
        ylim = c(0, 1.2))
text(x = 1:4 * 1.2 - 0.5, y = and_results + 0.1,
     labels = ifelse(and_results == 1, "TRUE", "FALSE"),
     font = 2)

# 2. OR operator truth table visualization
or_results <- c(
  "T | T" = 1,   # TRUE
  "T | F" = 1,   # TRUE
  "F | T" = 1,   # TRUE
  "F | F" = 0    # FALSE
)

barplot(or_results,
        main = "OR Operator (|)\nAt Least One Must Be TRUE",
        ylab = "Result (1=TRUE, 0=FALSE)",
        col = ifelse(or_results == 1, "green", "red"),
        ylim = c(0, 1.2))
text(x = 1:4 * 1.2 - 0.5, y = or_results + 0.1,
     labels = ifelse(or_results == 1, "TRUE", "FALSE"),
     font = 2)

# 3. Real-world example: filtering students
student_data <- data.frame(
  Name = c("Alice", "Bob", "Carol", "David", "Eve", "Frank", "Grace", "Henry"),
  Grade = c(92, 78, 85, 68, 95, 72, 88, 82),
  Attendance = c(95, 85, 92, 75, 98, 70, 90, 88)
)

# Different filtering criteria
high_grade <- student_data$Grade >= 85
high_attendance <- student_data$Attendance >= 90

filter_counts <- c(
  "Only High Grade" = sum(high_grade),
  "Only High Attend" = sum(high_attendance),
  "BOTH (AND)" = sum(high_grade & high_attendance),
  "EITHER (OR)" = sum(high_grade | high_attendance)
)

barplot(filter_counts,
        main = "Filtering Students\n(Grade≥85, Attendance≥90)",
        ylab = "Number of Students",
        col = c("lightblue", "lightgreen", "purple", "orange"),
        ylim = c(0, max(filter_counts) + 1),
        las = 2)
text(x = 1:4 * 1.2 - 0.5, y = filter_counts + 0.3,
     labels = filter_counts, font = 2)

# 4. Venn diagram-style visualization
plot(0:10, 0:10, type = "n", axes = FALSE, xlab = "", ylab = "",
     main = "Logical Operators Visualized")

# Draw circles
symbols(3, 5, circles = 2.5, inches = FALSE, add = TRUE, 
        fg = "blue", lwd = 3)
symbols(7, 5, circles = 2.5, inches = FALSE, add = TRUE, 
        fg = "red", lwd = 3)

# Label regions
text(2, 5, "Grade≥85\nOnly", col = "blue", font = 2)
text(8, 5, "Attend≥90\nOnly", col = "red", font = 2)
text(5, 5, "BOTH\n(AND)", col = "purple", font = 2, cex = 1.2)
text(5, 1, "EITHER (OR) = Everything inside both circles", 
     col = "orange", font = 2, cex = 0.9)

par(mfrow = c(1, 1))

cat("\n=== LOGICAL OPERATORS DEMONSTRATION ===\n")

## 
## === LOGICAL OPERATORS DEMONSTRATION ===

cat("\nStudent Data:\n")

## 
## Student Data:

print(student_data)

##    Name Grade Attendance
## 1 Alice    92         95
## 2   Bob    78         85
## 3 Carol    85         92
## 4 David    68         75
## 5   Eve    95         98
## 6 Frank    72         70
## 7 Grace    88         90
## 8 Henry    82         88

cat("\n1. AND Operator (&) - BOTH conditions must be TRUE:\n")

## 
## 1. AND Operator (&) - BOTH conditions must be TRUE:

excellence <- student_data[high_grade & high_attendance, ]
cat("  Students with Grade≥85 AND Attendance≥90:\n")

##   Students with Grade≥85 AND Attendance≥90:

if(nrow(excellence) > 0) {
  print(excellence)
} else {
  cat("  (None)\n")
}

##    Name Grade Attendance
## 1 Alice    92         95
## 3 Carol    85         92
## 5   Eve    95         98
## 7 Grace    88         90

cat("\n2. OR Operator (|) - AT LEAST ONE condition must be TRUE:\n")

## 
## 2. OR Operator (|) - AT LEAST ONE condition must be TRUE:

doing_well <- student_data[high_grade | high_attendance, ]
cat("  Students with Grade≥85 OR Attendance≥90:\n")

##   Students with Grade≥85 OR Attendance≥90:

print(doing_well)

##    Name Grade Attendance
## 1 Alice    92         95
## 3 Carol    85         92
## 5   Eve    95         98
## 7 Grace    88         90

cat("\n3. NOT Operator (!) - NEGATES the condition:\n")

## 
## 3. NOT Operator (!) - NEGATES the condition:

struggling <- student_data[!high_grade & !high_attendance, ]
cat("  Students with Grade<85 AND Attendance<90:\n")

##   Students with Grade<85 AND Attendance<90:

if(nrow(struggling) > 0) {
  print(struggling)
} else {
  cat("  (None)\n")
}

##    Name Grade Attendance
## 2   Bob    78         85
## 4 David    68         75
## 6 Frank    72         70
## 8 Henry    82         88

cat("\n4. Complex Combination:\n")

## 
## 4. Complex Combination:

cat("  Students with (Grade≥85 AND Attendance≥90) OR Grade≥95:\n")

##   Students with (Grade≥85 AND Attendance≥90) OR Grade≥95:

special <- student_data[(high_grade & high_attendance) | (student_data$Grade >= 95), ]
print(special)

##    Name Grade Attendance
## 1 Alice    92         95
## 3 Carol    85         92
## 5   Eve    95         98
## 7 Grace    88         90

cat("\nSummary Counts:\n")

## 
## Summary Counts:

cat("  Total students:", nrow(student_data), "\n")

##   Total students: 8

cat("  High grade (≥85):", sum(high_grade), "\n")

##   High grade (≥85): 4

cat("  High attendance (≥90):", sum(high_attendance), "\n")

##   High attendance (≥90): 4

cat("  Both (AND):", sum(high_grade & high_attendance), "\n")

##   Both (AND): 4

cat("  Either (OR):", sum(high_grade | high_attendance), "\n")

##   Either (OR): 4

cat("  Neither:", sum(!high_grade & !high_attendance), "\n")

##   Neither: 4

Common Beginner Mistake: Using = instead of == for comparison! Remember: x = 5 ASSIGNS 5 to x, while x == 5 CHECKS if x equals 5. Also, remember that & and | work element-wise on vectors, while && and || only look at the first element (used in if-statements, which you’ll learn later).

Using Logical Operators for Filtering

# Filter airquality data
hot_days <- airquality[airquality$Temp > 85, ]
head(hot_days)

# Multiple conditions
hot_windy <- airquality[airquality$Temp > 85 & airquality$Wind < 10, ]
head(hot_windy)

# Count matching rows
sum(airquality$Temp > 85, na.rm = TRUE)

## [1] 34

Part 6: Handling Missing Data

In Plain English: Real-world data is messy! Sometimes values are missing - maybe someone didn’t answer a survey question, or a sensor failed to record a measurement. R represents missing data as NA (Not Available). Learning to identify and handle missing data is CRUCIAL because ignoring it can lead to wrong conclusions.

6.1 Identifying Missing Values

# Create data with missing values
x <- c(1, 2, NA, 4, 5, NA)

# Check for missing values
is.na(x)

## [1] FALSE FALSE  TRUE FALSE FALSE  TRUE

# Count missing values
sum(is.na(x))

## [1] 2

# Which positions are missing?
which(is.na(x))

## [1] 3 6

# Check airquality for missing values
sum(is.na(airquality))                    # Total NAs

## [1] 44

colSums(is.na(airquality))                # NAs per column

##   Ozone Solar.R    Wind    Temp   Month     Day 
##      37       7       0       0       0       0

Visualizing Missing Data:

# Create example dataset with missing values
set.seed(123)
example_data <- data.frame(
  Student = paste0("S", 1:20),
  Test1 = c(85, 90, NA, 78, 92, 88, NA, 76, 95, 82,
            87, NA, 91, 84, 89, 93, 81, NA, 86, 88),
  Test2 = c(82, NA, 88, 76, 90, NA, 85, 79, NA, 84,
            86, 89, 92, NA, 87, 91, 83, 85, NA, 87),
  Test3 = c(NA, 92, 86, NA, 94, 87, 83, NA, 96, 85,
            88, 90, NA, 86, NA, 94, 84, 86, 88, NA)
)

# Visualize the pattern of missing data
par(mfrow = c(2, 2))

# 1. Missing data pattern
missing_matrix <- is.na(example_data[, -1])  # Exclude Student column
image(1:ncol(missing_matrix), 1:nrow(missing_matrix), 
      t(missing_matrix),
      col = c("lightblue", "red"),
      main = "Missing Data Pattern\n(Red = Missing)",
      xlab = "Test",
      ylab = "Student",
      axes = FALSE)
axis(1, at = 1:3, labels = c("Test1", "Test2", "Test3"))
axis(2, at = seq(1, 20, 5), labels = seq(1, 20, 5), las = 2)

# 2. Count of missing per test
missing_counts <- colSums(is.na(example_data[, -1]))
barplot(missing_counts,
        main = "Missing Values per Test",
        ylab = "Number Missing",
        col = "salmon",
        ylim = c(0, max(missing_counts) + 2))
text(x = 1:length(missing_counts) * 1.2 - 0.5, 
     y = missing_counts + 0.5, 
     labels = missing_counts)

# 3. Complete vs incomplete cases
complete_status <- ifelse(complete.cases(example_data), "Complete", "Has Missing")
status_table <- table(complete_status)
barplot(status_table,
        main = "Students: Complete vs Incomplete Data",
        ylab = "Count",
        col = c("green", "red"))

# 4. Comparison: with and without NAs
test1_with_na <- example_data$Test1
test1_without_na <- test1_with_na[!is.na(test1_with_na)]

boxplot(test1_with_na, test1_without_na,
        names = c("With NAs\n(errors out)", "NAs Removed\n(works)"),
        main = "Effect of NA on Analysis",
        ylab = "Test1 Score",
        col = c("pink", "lightgreen"))

par(mfrow = c(1, 1))

cat("\n=== MISSING DATA SUMMARY ===\n")

## 
## === MISSING DATA SUMMARY ===

cat("Total students:", nrow(example_data), "\n")

## Total students: 20

cat("Students with complete data:", sum(complete.cases(example_data)), "\n")

## Students with complete data: 5

cat("Students with missing data:", sum(!complete.cases(example_data)), "\n")

## Students with missing data: 15

cat("Total missing values:", sum(is.na(example_data)), "\n")

## Total missing values: 15

Key concept – Missing Data: Field, Miles, and Field (2012, ch. 5) discuss three types of missing data mechanisms: Missing Completely At Random (MCAR), Missing At Random (MAR), and Missing Not At Random (MNAR). Understanding why data is missing determines appropriate handling strategies.

6.2 Handling Missing Values

In Plain English: Sometimes the best solution is to simply remove rows with missing data. Think of it like throwing away a damaged product on an assembly line. BUT be careful - if you throw away too much data, you might not have enough left for reliable analysis!

# Remove NAs from vector
x_complete <- x[!is.na(x)]
x_complete

## [1] 1 2 4 5

# Or use na.omit()
x_clean <- na.omit(x)
x_clean

## [1] 1 2 4 5
## attr(,"na.action")
## [1] 3 6
## attr(,"class")
## [1] "omit"

# Remove rows with any NA
airquality_complete <- na.omit(airquality)
nrow(airquality)           # Original

## [1] 153

nrow(airquality_complete)  # After removing NAs

## [1] 111

# Use complete.cases()
complete_rows <- complete.cases(airquality)
airquality_filtered <- airquality[complete_rows, ]

Visualizing the Impact of Removing NAs:

par(mfrow = c(2, 2))

# 1. Show data loss
data_summary <- c(Original = nrow(airquality), 
                  Complete = nrow(airquality_complete))
barplot(data_summary,
        main = "Data Loss from NA Removal",
        ylab = "Number of Rows",
        col = c("lightblue", "lightgreen"),
        ylim = c(0, max(data_summary) + 20))
text(x = 1:2 * 1.2 - 0.5, 
     y = data_summary + 5, 
     labels = paste0(data_summary, " rows\n", 
                     round(data_summary/nrow(airquality)*100, 1), "%"))

# 2. Compare distributions: Original Ozone
hist(airquality$Ozone,
     main = "Original Ozone Data\n(with NAs)",
     xlab = "Ozone Level",
     col = "pink",
     breaks = 15)
abline(v = mean(airquality$Ozone, na.rm = TRUE), 
       col = "red", lwd = 2, lty = 2)
text(x = mean(airquality$Ozone, na.rm = TRUE), y = 10,
     labels = paste0("Mean = ", round(mean(airquality$Ozone, na.rm = TRUE), 1)),
     pos = 4, col = "red")

# 3. Compare distributions: After NA removal
hist(airquality_complete$Ozone,
     main = "Complete Cases Only\n(NAs removed)",
     xlab = "Ozone Level",
     col = "lightgreen",
     breaks = 15)
abline(v = mean(airquality_complete$Ozone), 
       col = "darkgreen", lwd = 2, lty = 2)
text(x = mean(airquality_complete$Ozone), y = 10,
     labels = paste0("Mean = ", round(mean(airquality_complete$Ozone), 1)),
     pos = 4, col = "darkgreen")

# 4. Side-by-side boxplot comparison
boxplot(airquality$Ozone, airquality_complete$Ozone,
        names = c("With NAs\n(n=153)", 
                  "Complete Cases\n(n=111)"),
        main = "Distribution Comparison",
        ylab = "Ozone Level",
        col = c("pink", "lightgreen"))

par(mfrow = c(1, 1))

cat("\n=== IMPACT OF NA REMOVAL ===\n")

## 
## === IMPACT OF NA REMOVAL ===

cat("Original rows:", nrow(airquality), "\n")

## Original rows: 153

cat("Rows with complete data:", nrow(airquality_complete), "\n")

## Rows with complete data: 111

cat("Rows removed:", nrow(airquality) - nrow(airquality_complete), "\n")

## Rows removed: 42

cat("Percentage of data retained:", 
    round(nrow(airquality_complete)/nrow(airquality)*100, 1), "%\n")

## Percentage of data retained: 72.5 %

cat("\nOzone Mean (with na.rm=TRUE):", 
    round(mean(airquality$Ozone, na.rm = TRUE), 2), "\n")

## 
## Ozone Mean (with na.rm=TRUE): 42.13

cat("Ozone Mean (complete cases only):", 
    round(mean(airquality_complete$Ozone), 2), "\n")

## Ozone Mean (complete cases only): 42.1

cat("Difference:", 
    round(mean(airquality$Ozone, na.rm = TRUE) - mean(airquality_complete$Ozone), 2), "\n")

## Difference: 0.03

Important: Notice how removing NAs can change your dataset! In this example, we lost 42 rows (27% of data). Always report how much data was removed due to missing values, as this affects the generalizability of your results.

6.3 Functions and Missing Data

In Plain English: Here’s a trap beginners often fall into - if your data has even ONE missing value, functions like mean() will return NA instead of a number! It’s like asking “What’s the average height?” when one person didn’t report their height - R says “I can’t tell you an ‘average’ because I don’t have all the data.” The solution? Use na.rm = TRUE (NA remove = TRUE) to tell R “ignore the missing values and calculate anyway.”

# This returns NA
mean(airquality$Ozone)

## [1] NA

# Remove NAs before calculating
mean(airquality$Ozone, na.rm = TRUE)

## [1] 42.12931

median(airquality$Ozone, na.rm = TRUE)

## [1] 31.5

sd(airquality$Ozone, na.rm = TRUE)

## [1] 32.98788

# Summary handles NAs automatically
summary(airquality$Ozone)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    1.00   18.00   31.50   42.13   63.25  168.00      37

Visualizing Why na.rm = TRUE Matters:

# Create example with and without NAs
clean_scores <- c(85, 90, 78, 92, 88, 76, 95, 82)
dirty_scores <- c(85, 90, NA, 92, 88, NA, 95, 82)

par(mfrow = c(2, 2))

# 1. Comparison bar chart
methods <- c("Without na.rm", "With na.rm=TRUE")
results <- c(NA, mean(dirty_scores, na.rm = TRUE))
barplot(results,
        names.arg = methods,
        main = "Effect of na.rm Parameter",
        ylab = "Mean Score",
        col = c("red", "green"),
        ylim = c(0, 100))
text(x = c(0.7, 1.9), y = c(5, results[2] + 3),
     labels = c("Returns NA!", round(results[2], 1)))

# 2. Show clean vs dirty data distribution
boxplot(clean_scores, dirty_scores,
        names = c("Clean Data", "Data with NAs"),
        main = "Comparing Clean vs. Dirty Data",
        ylab = "Score",
        col = c("lightgreen", "pink"))
abline(h = mean(clean_scores), col = "darkgreen", lty = 2, lwd = 2)
abline(h = mean(dirty_scores, na.rm = TRUE), col = "red", lty = 2, lwd = 2)
legend("bottomright", 
       legend = c("Mean (clean)", "Mean (dirty, na.rm=T)"),
       col = c("darkgreen", "red"), lty = 2, lwd = 2)

# 3. Multiple statistics comparison
stats_clean <- c(Mean = mean(clean_scores),
                Median = median(clean_scores),
                SD = sd(clean_scores))
stats_dirty <- c(Mean = mean(dirty_scores, na.rm = TRUE),
                Median = median(dirty_scores, na.rm = TRUE),
                SD = sd(dirty_scores, na.rm = TRUE))

barplot(rbind(stats_clean, stats_dirty),
        beside = TRUE,
        main = "Clean vs Dirty: Multiple Statistics",
        ylab = "Value",
        col = c("lightgreen", "pink"),
        legend.text = c("Clean Data", "With NAs (na.rm=T)"),
        args.legend = list(x = "topright"))

# 4. Show how many observations were used
n_clean <- length(clean_scores)
n_dirty_total <- length(dirty_scores)
n_dirty_complete <- sum(!is.na(dirty_scores))

counts <- rbind(c(n_clean, n_clean),
               c(n_dirty_total, n_dirty_complete))
barplot(counts,
        beside = TRUE,
        main = "Sample Size: Total vs. Used",
        names.arg = c("Total N", "Used in Calculation"),
        ylab = "Count",
        col = c("lightgreen", "pink"),
        legend.text = c("Clean Data", "With NAs"),
        args.legend = list(x = "topleft"))

par(mfrow = c(1, 1))

cat("\n=== COMPARISON: CLEAN VS. DIRTY DATA ===\n")

## 
## === COMPARISON: CLEAN VS. DIRTY DATA ===

cat("\nCLEAN DATA (no missing):\n")

## 
## CLEAN DATA (no missing):

cat("  N:", length(clean_scores), "\n")

##   N: 8

cat("  Mean:", round(mean(clean_scores), 2), "\n")

##   Mean: 85.75

cat("  Median:", round(median(clean_scores), 2), "\n")

##   Median: 86.5

cat("  SD:", round(sd(clean_scores), 2), "\n")

##   SD: 6.73

cat("\nDIRTY DATA (with NAs):\n")

## 
## DIRTY DATA (with NAs):

cat("  Total N:", length(dirty_scores), "\n")

##   Total N: 8

cat("  Missing:", sum(is.na(dirty_scores)), "\n")

##   Missing: 2

cat("  Complete:", sum(!is.na(dirty_scores)), "\n")

##   Complete: 6

cat("  Mean (without na.rm):", mean(dirty_scores), "\n")

##   Mean (without na.rm): NA

cat("  Mean (with na.rm=TRUE):", round(mean(dirty_scores, na.rm = TRUE), 2), "\n")

##   Mean (with na.rm=TRUE): 88.67

cat("  Median (with na.rm=TRUE):", round(median(dirty_scores, na.rm = TRUE), 2), "\n")

##   Median (with na.rm=TRUE): 89

cat("  SD (with na.rm=TRUE):", round(sd(dirty_scores, na.rm = TRUE), 2), "\n")

##   SD (with na.rm=TRUE): 4.72

Pro Tip: Always use na.rm = TRUE with statistical functions when you have missing data. But also REPORT how many values were missing! Saying “mean = 87” is misleading if based on 6 observations when you started with 8.

Part 7: Essential R Functions

7.1 Getting Help

# View help documentation
?mean
help(mean)

# Search help
??correlation

# View function arguments
args(mean)

# Run examples
example(mean)

7.2 Descriptive Statistics

In Plain English: Descriptive statistics are like the “highlights reel” of your data. Instead of looking at hundreds or thousands of numbers, you get a few key summaries: What’s typical? (mean, median) How spread out? (SD, range) What are the extremes? (min, max). These are your data’s “vital signs.”

# Create sample data
scores <- c(85, 92, 78, 90, 88, 95, 82, 87)

# Measures of central tendency
mean(scores)        # Average

## [1] 87.125

median(scores)      # Middle value

## [1] 87.5

# mode() doesn't exist in base R

# Measures of dispersion
sd(scores)          # Standard deviation

## [1] 5.462535

var(scores)         # Variance

## [1] 29.83929

range(scores)       # Min and max

## [1] 78 95

min(scores)

## [1] 78

max(scores)

## [1] 95

# Other useful functions
sum(scores)

## [1] 697

length(scores)

## [1] 8

quantile(scores, probs = c(0.25, 0.5, 0.75))  # Quartiles

##   25%   50%   75% 
## 84.25 87.50 90.50

# Comprehensive summary
summary(scores)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   78.00   84.25   87.50   87.12   90.50   95.00

Visualizing Descriptive Statistics:

par(mfrow = c(2, 2))

# 1. Histogram with mean and median
hist(scores,
     main = "Distribution with Central Tendency",
     xlab = "Score",
     col = "lightblue",
     breaks = 8,
     ylim = c(0, 3))
abline(v = mean(scores), col = "red", lwd = 3, lty = 2)
abline(v = median(scores), col = "blue", lwd = 3, lty = 2)
legend("topright", 
       legend = c(paste("Mean =", round(mean(scores), 1)),
                 paste("Median =", round(median(scores), 1))),
       col = c("red", "blue"), lty = 2, lwd = 3)

# 2. Boxplot showing quartiles and outliers
boxplot(scores,
        main = "Boxplot: Quartiles & Range",
        ylab = "Score",
        col = "lightgreen",
        horizontal = FALSE)
points(1, mean(scores), pch = 18, col = "red", cex = 2)
text(1.3, quantile(scores, 0.75), "Q3 (75%)", pos = 4)
text(1.3, median(scores), "Median (Q2)", pos = 4)
text(1.3, quantile(scores, 0.25), "Q1 (25%)", pos = 4)
text(1.3, mean(scores), "Mean", pos = 4, col = "red")

# 3. Barplot of all individual scores with mean line
barplot(scores,
        main = "Individual Scores",
        ylab = "Score",
        xlab = "Student",
        col = rainbow(length(scores)),
        ylim = c(0, 100))
abline(h = mean(scores), col = "black", lwd = 2, lty = 2)
text(4, mean(scores) + 3, paste("Mean =", round(mean(scores), 1)))

# 4. Comparing key statistics
key_stats <- c(Min = min(scores),
              Q1 = quantile(scores, 0.25),
              Median = median(scores),
              Mean = mean(scores),
              Q3 = quantile(scores, 0.75),
              Max = max(scores))

barplot(key_stats,
        main = "Summary Statistics Overview",
        ylab = "Value",
        col = c("red", "orange", "yellow", "lightgreen", "lightblue", "purple"),
        ylim = c(0, 100),
        las = 2)
text(x = 1:length(key_stats) * 1.2 - 0.5,
     y = key_stats + 3,
     labels = round(key_stats, 1))

par(mfrow = c(1, 1))

cat("\n=== DESCRIPTIVE STATISTICS SUMMARY ===\n")

## 
## === DESCRIPTIVE STATISTICS SUMMARY ===

cat("Sample size (N):", length(scores), "\n")

## Sample size (N): 8

cat("\nMeasures of Central Tendency:\n")

## 
## Measures of Central Tendency:

cat("  Mean:", round(mean(scores), 2), "\n")

##   Mean: 87.12

cat("  Median:", median(scores), "\n")

##   Median: 87.5

cat("\nMeasures of Dispersion:\n")

## 
## Measures of Dispersion:

cat("  Standard Deviation:", round(sd(scores), 2), "\n")

##   Standard Deviation: 5.46

cat("  Variance:", round(var(scores), 2), "\n")

##   Variance: 29.84

cat("  Range:", range(scores)[1], "to", range(scores)[2], "\n")

##   Range: 78 to 95

cat("  IQR (Q3-Q1):", round(IQR(scores), 2), "\n")

##   IQR (Q3-Q1): 6.25

cat("\nExtreme Values:\n")

## 
## Extreme Values:

cat("  Minimum:", min(scores), "\n")

##   Minimum: 78

cat("  Maximum:", max(scores), "\n")

##   Maximum: 95

cat("\nQuartiles:\n")

## 
## Quartiles:

cat("  Q1 (25th percentile):", quantile(scores, 0.25), "\n")

##   Q1 (25th percentile): 84.25

cat("  Q2 (50th percentile / Median):", quantile(scores, 0.50), "\n")

##   Q2 (50th percentile / Median): 87.5

cat("  Q3 (75th percentile):", quantile(scores, 0.75), "\n")

##   Q3 (75th percentile): 90.5

Field et al. (2012, ch. 2) on Descriptive Statistics: “Before you can analyze your data, you need to describe it. Central tendency tells you about the typical score, dispersion tells you about the variability, and together they give you a complete picture of your data’s distribution.”

7.3 Frequency Tables

In Plain English: Frequency tables answer the question “How many?” How many students got an A? How many cars have 4 cylinders? They’re especially useful for categorical data (like grades, colors, or yes/no responses). Think of it as counting how many times each unique value appears in your data.

# Create categorical data
grades <- c("A", "B", "A", "C", "B", "A", "B", "A", "C", "B")

# Frequency table
table(grades)

## grades
## A B C 
## 4 4 2

# Proportions
prop.table(table(grades))

## grades
##   A   B   C 
## 0.4 0.4 0.2

# Cross-tabulation (two variables)
data(mtcars)
table(mtcars$cyl, mtcars$gear)

##    
##      3  4  5
##   4  1  8  2
##   6  2  4  1
##   8 12  0  2

# Add margins (totals)
addmargins(table(mtcars$cyl, mtcars$gear))

##      
##        3  4  5 Sum
##   4    1  8  2  11
##   6    2  4  1   7
##   8   12  0  2  14
##   Sum 15 12  5  32

Visualizing Frequency Tables:

par(mfrow = c(2, 2))

# 1. Simple bar chart of grades
grade_freq <- table(grades)
barplot(grade_freq,
        main = "Grade Distribution",
        xlab = "Grade",
        ylab = "Frequency (Count)",
        col = c("gold", "lightgray", "coral"),
        ylim = c(0, max(grade_freq) + 1))
text(x = 1:length(grade_freq) * 1.2 - 0.5,
     y = grade_freq + 0.3,
     labels = grade_freq)

# 2. Proportions (percentage) chart
grade_prop <- prop.table(grade_freq)
barplot(grade_prop,
        main = "Grade Distribution (Proportions)",
        xlab = "Grade",
        ylab = "Proportion",
        col = c("gold", "lightgray", "coral"),
        ylim = c(0, max(grade_prop) + 0.1))
text(x = 1:length(grade_prop) * 1.2 - 0.5,
     y = grade_prop + 0.02,
     labels = paste0(round(grade_prop * 100, 1), "%"))

# 3. Cross-tabulation: Cylinders vs Gears (heatmap style)
cross_tab <- table(mtcars$cyl, mtcars$gear)
barplot(cross_tab,
        beside = TRUE,
        main = "Cars: Cylinders × Gears",
        xlab = "Number of Gears",
        ylab = "Count",
        col = c("lightblue", "lightgreen", "salmon"),
        legend.text = c("4 cyl", "6 cyl", "8 cyl"),
        args.legend = list(x = "topright"))

# 4. Cross-tabulation as heatmap
image(1:ncol(cross_tab), 1:nrow(cross_tab), 
      t(as.matrix(cross_tab)),
      col = heat.colors(max(cross_tab)),
      main = "Heatmap: Cylinders × Gears",
      xlab = "Number of Gears",
      ylab = "Number of Cylinders",
      axes = FALSE)
axis(1, at = 1:ncol(cross_tab), labels = colnames(cross_tab))
axis(2, at = 1:nrow(cross_tab), labels = rownames(cross_tab), las = 2)

# Add cell values
for(i in 1:nrow(cross_tab)) {
  for(j in 1:ncol(cross_tab)) {
    text(j, i, cross_tab[i,j], col = "black", cex = 1.5)
  }
}

par(mfrow = c(1, 1))

cat("\n=== FREQUENCY TABLE SUMMARY ===\n")

## 
## === FREQUENCY TABLE SUMMARY ===

cat("\nGrade Frequencies:\n")

## 
## Grade Frequencies:

print(grade_freq)

## grades
## A B C 
## 4 4 2

cat("\nGrade Proportions:\n")

## 
## Grade Proportions:

print(round(grade_prop, 3))

## grades
##   A   B   C 
## 0.4 0.4 0.2

cat("\nPercentages:\n")

## 
## Percentages:

print(paste0(names(grade_prop), ": ", round(grade_prop * 100, 1), "%"))

## [1] "A: 40%" "B: 40%" "C: 20%"

cat("\n\nCross-Tabulation: Cylinders × Gears\n")

## 
## 
## Cross-Tabulation: Cylinders × Gears

print(addmargins(cross_tab))

##      
##        3  4  5 Sum
##   4    1  8  2  11
##   6    2  4  1   7
##   8   12  0  2  14
##   Sum 15 12  5  32

cat("\nInterpretation: Most common combination is", 
    paste0(rownames(cross_tab)[which(cross_tab == max(cross_tab), arr.ind = TRUE)[1,1]],
           " cylinders with ",
           colnames(cross_tab)[which(cross_tab == max(cross_tab), arr.ind = TRUE)[1,2]],
           " gears (n=", max(cross_tab), ")"))

## 
## Interpretation: Most common combination is 8 cylinders with 3 gears (n=12)

Why This Matters: Frequency tables are the foundation of categorical data analysis. They help you spot patterns (“Most students got A or B”), identify rare events (“Only 2 customers complained”), and prepare data for chi-square tests and other statistical analyses.


## 7.4 Data Exploration Functions


``` r
data(iris)

# Structure
str(iris)

## 'data.frame':    150 obs. of  5 variables:
##  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
##  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
##  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
##  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

# Summary statistics
summary(iris)

##   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
##  Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
##  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
##  Median :5.800   Median :3.000   Median :4.350   Median :1.300  
##  Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
##  3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
##  Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
##        Species  
##  setosa    :50  
##  versicolor:50  
##  virginica :50  
##                 
##                 
##

# Dimensions
dim(iris)

## [1] 150   5

nrow(iris)

## [1] 150

ncol(iris)

## [1] 5

# Column names
names(iris)

## [1] "Sepal.Length" "Sepal.Width"  "Petal.Length" "Petal.Width"  "Species"

colnames(iris)

## [1] "Sepal.Length" "Sepal.Width"  "Petal.Length" "Petal.Width"  "Species"

# First and last rows
head(iris, n = 3)

tail(iris, n = 3)

# Unique values
unique(iris$Species)

## [1] setosa     versicolor virginica 
## Levels: setosa versicolor virginica

length(unique(iris$Species))

## [1] 3

7.5 Correlation and Covariance

In Plain English: Correlation measures how two variables move together. Do taller students weigh more? Does studying more lead to higher grades? Correlations range from -1 (perfect negative: as one goes up, other goes down) to +1 (perfect positive: both move together). Zero means no relationship. Covariance is similar but harder to interpret (not standardized), so we usually prefer correlation.

# Correlation between two variables
cor(iris$Sepal.Length, iris$Sepal.Width)

## [1] -0.1175698

# Correlation matrix (numeric columns only)
cor(iris[, 1:4])

##              Sepal.Length Sepal.Width Petal.Length Petal.Width
## Sepal.Length    1.0000000  -0.1175698    0.8717538   0.8179411
## Sepal.Width    -0.1175698   1.0000000   -0.4284401  -0.3661259
## Petal.Length    0.8717538  -0.4284401    1.0000000   0.9628654
## Petal.Width     0.8179411  -0.3661259    0.9628654   1.0000000

# Handle missing data
cor(airquality$Ozone, airquality$Temp, use = "complete.obs")

## [1] 0.6983603

# Covariance
cov(iris$Sepal.Length, iris$Sepal.Width)

## [1] -0.042434

cov(iris[, 1:4])

##              Sepal.Length Sepal.Width Petal.Length Petal.Width
## Sepal.Length    0.6856935  -0.0424340    1.2743154   0.5162707
## Sepal.Width    -0.0424340   0.1899794   -0.3296564  -0.1216394
## Petal.Length    1.2743154  -0.3296564    3.1162779   1.2956094
## Petal.Width     0.5162707  -0.1216394    1.2956094   0.5810063

Visualizing Correlations:

par(mfrow = c(2, 2))

# 1. Positive correlation example
set.seed(123)
x_pos <- 1:50
y_pos <- 2 * x_pos + rnorm(50, 0, 10)
cor_pos <- cor(x_pos, y_pos)

plot(x_pos, y_pos,
     main = paste0("Positive Correlation\nr = ", round(cor_pos, 2)),
     xlab = "Study Hours",
     ylab = "Grade",
     pch = 19,
     col = "blue")
abline(lm(y_pos ~ x_pos), col = "red", lwd = 2)
text(10, max(y_pos) - 10, "As X increases,\nY increases", col = "red", font = 2)

# 2. Negative correlation example
y_neg <- -2 * x_pos + 100 + rnorm(50, 0, 10)
cor_neg <- cor(x_pos, y_neg)

plot(x_pos, y_neg,
     main = paste0("Negative Correlation\nr = ", round(cor_neg, 2)),
     xlab = "Hours Watching TV",
     ylab = "Grade",
     pch = 19,
     col = "red")
abline(lm(y_neg ~ x_pos), col = "blue", lwd = 2)
text(10, min(y_neg) + 10, "As X increases,\nY decreases", col = "blue", font = 2)

# 3. No correlation example
y_zero <- rnorm(50, 50, 15)
cor_zero <- cor(x_pos, y_zero)

plot(x_pos, y_zero,
     main = paste0("No Correlation\nr = ", round(cor_zero, 2)),
     xlab = "Shoe Size",
     ylab = "Math Skill",
     pch = 19,
     col = "gray")
abline(lm(y_zero ~ x_pos), col = "darkgreen", lwd = 2)
text(25, 70, "No relationship!", col = "darkgreen", font = 2)

# 4. Real data: iris correlation matrix heatmap
cor_matrix <- cor(iris[, 1:4])
image(1:4, 1:4, cor_matrix,
      col = colorRampPalette(c("blue", "white", "red"))(20),
      main = "Iris Correlation Matrix Heatmap",
      xlab = "", ylab = "",
      axes = FALSE)
axis(1, at = 1:4, labels = colnames(iris)[1:4], las = 2, cex.axis = 0.8)
axis(2, at = 1:4, labels = colnames(iris)[1:4], las = 2, cex.axis = 0.8)

# Add correlation values
for(i in 1:4) {
  for(j in 1:4) {
    text(i, j, round(cor_matrix[j, i], 2), 
         col = ifelse(abs(cor_matrix[j, i]) > 0.5, "white", "black"),
         cex = 1.2, font = 2)
  }
}

par(mfrow = c(1, 1))

cat("\n=== CORRELATION ANALYSIS ===\n")

## 
## === CORRELATION ANALYSIS ===

cat("\nCorrelation Interpretation:\n")

## 
## Correlation Interpretation:

cat("  r = +1.0  : Perfect positive (both increase together)\n")

##   r = +1.0  : Perfect positive (both increase together)

cat("  r = +0.7  : Strong positive\n")

##   r = +0.7  : Strong positive

cat("  r = +0.3  : Weak positive\n")

##   r = +0.3  : Weak positive

cat("  r =  0.0  : No relationship\n")

##   r =  0.0  : No relationship

cat("  r = -0.3  : Weak negative\n")

##   r = -0.3  : Weak negative

cat("  r = -0.7  : Strong negative\n")

##   r = -0.7  : Strong negative

cat("  r = -1.0  : Perfect negative (one increases, other decreases)\n")

##   r = -1.0  : Perfect negative (one increases, other decreases)

cat("\nIris Dataset Correlations:\n")

## 
## Iris Dataset Correlations:

print(round(cor_matrix, 3))

##              Sepal.Length Sepal.Width Petal.Length Petal.Width
## Sepal.Length        1.000      -0.118        0.872       0.818
## Sepal.Width        -0.118       1.000       -0.428      -0.366
## Petal.Length        0.872      -0.428        1.000       0.963
## Petal.Width         0.818      -0.366        0.963       1.000

cat("\nStrongest Correlations in Iris:\n")

## 
## Strongest Correlations in Iris:

# Find strongest correlations (excluding diagonal)
cor_matrix_off_diag <- cor_matrix
diag(cor_matrix_off_diag) <- NA
max_cor <- max(cor_matrix_off_diag, na.rm = TRUE)
max_pos <- which(cor_matrix_off_diag == max_cor, arr.ind = TRUE)[1,]
cat("  Strongest positive:", 
    colnames(iris)[max_pos[1]], "and", colnames(iris)[max_pos[2]],
    "=", round(max_cor, 3), "\n")

##   Strongest positive: Petal.Width and Petal.Length = 0.963

cat("\nAirquality Temperature vs Ozone:\n")

## 
## Airquality Temperature vs Ozone:

temp_ozone_cor <- cor(airquality$Temp, airquality$Ozone, use = "complete.obs")
cat("  Correlation:", round(temp_ozone_cor, 3), "\n")

##   Correlation: 0.698

cat("  Interpretation: ",
    ifelse(temp_ozone_cor > 0.5, "Strong positive - hotter days have more ozone",
          ifelse(temp_ozone_cor > 0.3, "Moderate positive",
                ifelse(temp_ozone_cor < -0.3, "Negative relationship", "Weak relationship"))),
    "\n")

##   Interpretation:  Strong positive - hotter days have more ozone

Field et al. (2012, ch. 6) on Correlation: “Correlation does NOT imply caustation! Just because two variables correlate doesn’t mean one causes the other. Ice cream sales and drowning deaths correlate (both peak in summer), but ice cream doesn’t cause drowning - temperature is the confounding variable.”

Part 8: Writing Your Own Functions

In Plain English: Functions are like recipes. You write the instructions once, then use them over and over. Instead of typing the same code repeatedly, you create a function that does the work for you. Think of it as teaching R a new trick - once taught, R can perform that trick whenever you ask!

8.1 Function Basics

# Basic function structure
square <- function(x) {
  result <- x^2
  return(result)
}

# Use the function
square(5)

## [1] 25

square(10)

## [1] 100

# Simplified version (R returns last value automatically)
square_simple <- function(x) {
  x^2
}

square_simple(7)

## [1] 49

Visualizing How Functions Work:

par(mfrow = c(2, 2))

# 1. Show input-output relationship
inputs <- 1:10
outputs <- sapply(inputs, square_simple)

plot(inputs, outputs,
     type = "b",
     pch = 19,
     col = "blue",
     main = "Function: square(x) = x²",
     xlab = "Input (x)",
     ylab = "Output (x²)",
     cex = 1.5)
grid()
text(7, 40, "Output grows\nquadratically!", col = "red")

# 2. Compare multiple functions
cube <- function(x) x^3
sqrt_func <- function(x) sqrt(x)

x_vals <- seq(0, 5, 0.1)
plot(x_vals, x_vals, type = "l", lwd = 2, col = "black",
     main = "Comparing Different Functions",
     xlab = "Input (x)",
     ylab = "Output",
     ylim = c(0, 30))
lines(x_vals, square_simple(x_vals), col = "blue", lwd = 2)
lines(x_vals, sqrt_func(x_vals), col = "green", lwd = 2)
lines(x_vals, x_vals^3 / 10, col = "red", lwd = 2)
legend("topleft",
       legend = c("x (linear)", "x² (square)", "√x (root)", "x³/10 (cube)"),
       col = c("black", "blue", "green", "red"),
       lwd = 2)

# 3. Function with multiple inputs demonstrated
test_values <- c(2, 4, 6, 8, 10)
results <- square_simple(test_values)

barplot(rbind(test_values, results),
        beside = TRUE,
        names.arg = paste0("Test", 1:5),
        main = "Input vs Output",
        ylab = "Value",
        col = c("lightblue", "salmon"),
        legend.text = c("Input", "Output"),
        args.legend = list(x = "topleft"))

# 4. "Black box" concept
plot(0:1, 0:1, type = "n", axes = FALSE, xlab = "", ylab = "",
     main = "Function as a 'Black Box'")
rect(0.3, 0.3, 0.7, 0.7, col = "gray", border = "black", lwd = 3)
text(0.5, 0.5, "square(x)\n\nComputes\nx²", cex = 1.2, font = 2)
arrows(0.1, 0.5, 0.28, 0.5, lwd = 3, col = "blue")
text(0.15, 0.58, "Input\nx = 5", col = "blue", font = 2)
arrows(0.72, 0.5, 0.9, 0.5, lwd = 3, col = "red")
text(0.85, 0.58, "Output\n25", col = "red", font = 2)

par(mfrow = c(1, 1))

cat("\n=== FUNCTION DEMONSTRATION ===\n")

## 
## === FUNCTION DEMONSTRATION ===

cat("Testing square() function with different inputs:\n")

## Testing square() function with different inputs:

for(i in 1:5) {
  input <- i * 2
  output <- square_simple(input)
  cat("  square(", input, ") = ", output, "\n", sep = "")
}

##   square(2) = 4
##   square(4) = 16
##   square(6) = 36
##   square(8) = 64
##   square(10) = 100

Why Write Functions? (1) Don’t Repeat Yourself (DRY) - Write once, use many times. (2) Fewer Errors - Fix a bug once, fixed everywhere. (3) Easier to Read - calculate_bmi(weight, height) is clearer than scattered formula code.

8.2 Functions with Multiple Arguments

In Plain English: Just like a recipe needs multiple ingredients (flour, sugar, eggs), functions can take multiple inputs. You can even set default values - like having a default oven temperature that cooks use unless they specifically change it.

# Function with two arguments
add_numbers <- function(a, b) {
  a + b
}

add_numbers(5, 3)

## [1] 8

add_numbers(10, 15)

## [1] 25

# Function with default values
greet <- function(name = "Friend") {
  paste("Hello,", name)
}

greet("Alice")

## [1] "Hello, Alice"

greet()           # Uses default "Friend"

## [1] "Hello, Friend"

Visualizing Multiple Argument Functions:

# Create a practical multi-argument function
calculate_grade <- function(homework, midterm, final, 
                           hw_weight = 0.3, mid_weight = 0.3, final_weight = 0.4) {
  weighted_grade <- (homework * hw_weight) + (midterm * mid_weight) + (final * final_weight)
  return(weighted_grade)
}

par(mfrow = c(2, 2))

# 1. Show how different inputs produce different outputs
students <- c("Alice", "Bob", "Carol", "David")
hw_scores <- c(85, 90, 78, 95)
mid_scores <- c(88, 85, 92, 90)
final_scores <- c(90, 88, 85, 92)

final_grades <- mapply(calculate_grade, hw_scores, mid_scores, final_scores)

barplot(rbind(hw_scores, mid_scores, final_scores, final_grades),
        beside = TRUE,
        names.arg = students,
        main = "Multi-Input Function: Grade Calculation",
        ylab = "Score",
        col = c("lightblue", "lightgreen", "salmon", "gold"),
        legend.text = c("Homework", "Midterm", "Final", "Weighted Grade"),
        args.legend = list(x = "topleft", cex = 0.8))

# 2. Effect of changing default weights
alice_grades <- c(
  calculate_grade(85, 88, 90),                              # Default weights
  calculate_grade(85, 88, 90, hw_weight = 0.5, mid_weight = 0.2, final_weight = 0.3),
  calculate_grade(85, 88, 90, hw_weight = 0.1, mid_weight = 0.1, final_weight = 0.8)
)

barplot(alice_grades,
        names.arg = c("Default\n(0.3/0.3/0.4)", "Option1\n(0.5/0.2/0.3)", "Option2\n(0.1/0.1/0.8)"),
        main = "Effect of Different Weights\n(Alice: HW=85, Mid=88, Final=90)",
        ylab = "Weighted Grade",
        col = rainbow(3),
        ylim = c(0, 100))
text(x = 1:3 * 1.2 - 0.5, y = alice_grades + 2,
     labels = round(alice_grades, 1), font = 2)

# 3. Temperature conversion function with multiple formulas
convert_temp <- function(temp, from = "F", to = "C") {
  if(from == "F" && to == "C") {
    return((temp - 32) * 5/9)
  } else if(from == "C" && to == "F") {
    return(temp * 9/5 + 32)
  } else {
    return(temp)  # Same scale
  }
}

temps_F <- seq(0, 100, 20)
temps_C <- sapply(temps_F, convert_temp, from = "F", to = "C")

plot(temps_F, temps_C,
     type = "b",
     pch = 19,
     col = "red",
     main = "Temperature Conversion Function",
     xlab = "Fahrenheit",
     ylab = "Celsius",
     cex = 1.5)
grid()
abline(h = 0, col = "blue", lty = 2, lwd = 2)
abline(v = 32, col = "blue", lty = 2, lwd = 2)
text(32, -10, "Freezing Point", pos = 4, col = "blue")

# 4. Default vs. custom arguments comparison
bmi_calculator <- function(weight_kg, height_m, show_category = TRUE) {
  bmi <- weight_kg / (height_m^2)
  if(show_category) {
    category <- ifelse(bmi < 18.5, "Underweight",
                      ifelse(bmi < 25, "Normal",
                            ifelse(bmi < 30, "Overweight", "Obese")))
    return(list(BMI = round(bmi, 1), Category = category))
  } else {
    return(round(bmi, 1))
  }
}

# Test with different people
people <- c("Person1", "Person2", "Person3", "Person4")
weights <- c(70, 85, 60, 95)
heights <- c(1.75, 1.80, 1.65, 1.70)

bmis <- mapply(function(w, h) bmi_calculator(w, h, show_category = FALSE), 
               weights, heights)

barplot(bmis,
        names.arg = people,
        main = "BMI Calculator Function\n(weight_kg, height_m, show_category)",
        ylab = "BMI",
        col = ifelse(bmis < 18.5, "lightblue",
                    ifelse(bmis < 25, "lightgreen",
                          ifelse(bmis < 30, "yellow", "red"))),
        ylim = c(0, 35))
abline(h = c(18.5, 25, 30), lty = 2, col = "gray")
text(2.5, 18.5, "Underweight|Normal", pos = 3, cex = 0.8)
text(2.5, 25, "Normal|Overweight", pos = 3, cex = 0.8)
text(2.5, 30, "Overweight|Obese", pos = 3, cex = 0.8)

par(mfrow = c(1, 1))

cat("\n=== MULTI-ARGUMENT FUNCTION DEMONSTRATION ===\n")

## 
## === MULTI-ARGUMENT FUNCTION DEMONSTRATION ===

cat("\nGrade Calculation for 4 Students:\n")

## 
## Grade Calculation for 4 Students:

for(i in 1:length(students)) {
  cat(" ", students[i], ": HW=", hw_scores[i], ", Mid=", mid_scores[i], 
      ", Final=", final_scores[i], " → Weighted=", round(final_grades[i], 1), "\n", sep="")
}

##  Alice: HW=85, Mid=88, Final=90 → Weighted=87.9
##  Bob: HW=90, Mid=85, Final=88 → Weighted=87.7
##  Carol: HW=78, Mid=92, Final=85 → Weighted=85
##  David: HW=95, Mid=90, Final=92 → Weighted=92.3

cat("\nTemperature Conversions:\n")

## 
## Temperature Conversions:

cat("  0°F =", round(convert_temp(0, "F", "C"), 1), "°C\n")

##   0°F = -17.8 °C

cat("  32°F =", round(convert_temp(32, "F", "C"), 1), "°C\n")

##   32°F = 0 °C

cat("  100°F =", round(convert_temp(100, "F", "C"), 1), "°C\n")

##   100°F = 37.8 °C

cat("\nBMI Calculations:\n")

## 
## BMI Calculations:

for(i in 1:length(people)) {
  result <- bmi_calculator(weights[i], heights[i])
  cat(" ", people[i], ": BMI=", result$BMI, " (", result$Category, ")\n", sep="")
}

##  Person1: BMI=22.9 (Normal)
##  Person2: BMI=26.2 (Overweight)
##  Person3: BMI=22 (Normal)
##  Person4: BMI=32.9 (Obese)

Pro Tip on Default Arguments: Default values make functions easier to use. Users can call greet() without arguments for standard behavior, but can customize by providing greet("Bob") when needed. This is why many R functions have sensible defaults!

8.3 Practical Examples

In Plain English: Now let’s see functions “in the wild” - real statistical calculations you’ll use throughout this course. Once you write these functions, you can reuse them in every assignment!

# Calculate z-score
z_score <- function(x, mean_val, sd_val) {
  (x - mean_val) / sd_val
}

scores <- c(85, 92, 78, 90, 88)
z_score(85, mean(scores), sd(scores))

## [1] -0.2930973

# Convert Fahrenheit to Celsius
f_to_c <- function(temp_f) {
  temp_c <- (temp_f - 32) * 5/9
  return(temp_c)
}

f_to_c(98.6)

## [1] 37

f_to_c(32)

## [1] 0

f_to_c(212)

## [1] 100

# Function that returns multiple values (using list)
summary_stats <- function(x) {
  list(
    mean = mean(x, na.rm = TRUE),
    median = median(x, na.rm = TRUE),
    sd = sd(x, na.rm = TRUE),
    min = min(x, na.rm = TRUE),
    max = max(x, na.rm = TRUE)
  )
}

summary_stats(scores)

## $mean
## [1] 86.6
## 
## $median
## [1] 88
## 
## $sd
## [1] 5.458938
## 
## $min
## [1] 78
## 
## $max
## [1] 92

Visualizing Practical Functions:

par(mfrow = c(2, 2))

# 1. Z-score transformation
test_scores <- c(65, 75, 80, 85, 90, 95, 100)
z_scores <- (test_scores - mean(test_scores)) / sd(test_scores)

plot(test_scores, z_scores,
     type = "b",
     pch = 19,
     col = "blue",
     main = "Z-Score Transformation",
     xlab = "Original Score",
     ylab = "Z-Score (standard deviations)",
     cex = 1.5)
abline(h = 0, col = "red", lty = 2, lwd = 2)
abline(h = c(-1, 1), col = "orange", lty = 3)
text(70, 0.2, "Mean (z=0)", col = "red")
text(70, 1.2, "+1 SD", col = "orange")
text(70, -1.2, "-1 SD", col = "orange")
grid()

# 2. Temperature conversion function
temps_F <- seq(0, 100, 10)
temps_C <- sapply(temps_F, f_to_c)

plot(temps_F, temps_C,
     type = "b",
     pch = 19,
     col = "red",
     main = "Fahrenheit to Celsius Conversion",
     xlab = "Temperature (°F)",
     ylab = "Temperature (°C)",
     cex = 1.5,
     lwd = 2)
abline(h = seq(-20, 40, 10), col = "gray", lty = 3)
abline(v = seq(0, 100, 20), col = "gray", lty = 3)
points(32, 0, pch = 19, col = "blue", cex = 2)
text(32, 0, "  Freezing Point", pos = 4, col = "blue")
points(98.6, f_to_c(98.6), pch = 19, col = "darkgreen", cex = 2)
text(98.6, f_to_c(98.6), "  Body Temp", pos = 4, col = "darkgreen")

# 3. Summary statistics function visualization
example_data <- rnorm(100, mean = 75, sd = 10)
stats <- summary_stats(example_data)

hist(example_data,
     main = "Data Distribution with Summary Stats",
     xlab = "Value",
     col = "lightblue",
     breaks = 20)
abline(v = stats$mean, col = "red", lwd = 3, lty = 1)
abline(v = stats$median, col = "blue", lwd = 3, lty = 2)
abline(v = stats$min, col = "gray", lwd = 2, lty = 3)
abline(v = stats$max, col = "gray", lwd = 2, lty = 3)
legend("topright",
       legend = c(paste("Mean =", round(stats$mean, 1)),
                 paste("Median =", round(stats$median, 1)),
                 "Min/Max"),
       col = c("red", "blue", "gray"),
       lty = c(1, 2, 3),
       lwd = c(3, 3, 2))

# 4. Custom grading function
grade_letter <- function(score) {
  if(score >= 90) return("A")
  else if(score >= 80) return("B")
  else if(score >= 70) return("C")
  else if(score >= 60) return("D")
  else return("F")
}

student_scores <- c(95, 88, 76, 65, 92, 58, 82, 71, 85, 79)
letter_grades <- sapply(student_scores, grade_letter)
grade_table <- table(letter_grades)

barplot(grade_table,
        main = "Letter Grade Distribution",
        xlab = "Grade",
        ylab = "Number of Students",
        col = c("gold", "lightgreen", "lightblue", "orange", "red"),
        ylim = c(0, max(grade_table) + 1))
text(x = 1:length(grade_table) * 1.2 - 0.5,
     y = grade_table + 0.3,
     labels = grade_table,
     font = 2)

par(mfrow = c(1, 1))

cat("\n=== PRACTICAL FUNCTION EXAMPLES ===\n")

## 
## === PRACTICAL FUNCTION EXAMPLES ===

cat("\n1. Z-Score Transformation:\n")

## 
## 1. Z-Score Transformation:

cat("  Original scores:", scores, "\n")

##   Original scores: 85 92 78 90 88

cat("  Mean:", mean(scores), ", SD:", round(sd(scores), 2), "\n")

##   Mean: 86.6 , SD: 5.46

cat("  Z-scores:\n")

##   Z-scores:

for(i in 1:length(scores)) {
  z <- z_score(scores[i], mean(scores), sd(scores))
  cat("    Score", scores[i], "→ z =", round(z, 2), "\n")
}

##     Score 85 → z = -0.29 
##     Score 92 → z = 0.99 
##     Score 78 → z = -1.58 
##     Score 90 → z = 0.62 
##     Score 88 → z = 0.26

cat("\n2. Temperature Conversions:\n")

## 
## 2. Temperature Conversions:

important_temps <- c(32, 98.6, 212)
for(temp in important_temps) {
  cat("  ", temp, "°F = ", round(f_to_c(temp), 1), "°C\n", sep="")
}

##   32°F = 0°C
##   98.6°F = 37°C
##   212°F = 100°C

cat("\n3. Summary Statistics Function:\n")

## 
## 3. Summary Statistics Function:

test_data <- c(85, 92, 78, 90, 88, 95, 82, 87)
stats_result <- summary_stats(test_data)
cat("  Data:", test_data, "\n")

##   Data: 85 92 78 90 88 95 82 87

cat("  Mean:", round(stats_result$mean, 2), "\n")

##   Mean: 87.12

cat("  Median:", stats_result$median, "\n")

##   Median: 87.5

cat("  SD:", round(stats_result$sd, 2), "\n")

##   SD: 5.46

cat("  Range:", stats_result$min, "to", stats_result$max, "\n")

##   Range: 78 to 95

cat("\n4. Letter Grade Conversion:\n")

## 
## 4. Letter Grade Conversion:

cat("  Scores:", student_scores, "\n")

##   Scores: 95 88 76 65 92 58 82 71 85 79

cat("  Grades:", letter_grades, "\n")

##   Grades: A B C D A F B C B C

cat("  Distribution:", paste(names(grade_table), "=", grade_table, collapse=", "), "\n")

##   Distribution: A = 2, B = 3, C = 3, D = 1, F = 1

Real-World Application: These aren’t toy examples! The z_score() function is used in standardizing test scores, the summary_stats() function appears in every data analysis, and custom grading functions save instructors hours of work. Functions turn repetitive tasks into one-line commands!


---

# Part 9: Combining Data

**In Plain English:** Real data often comes in pieces - one file has customer names, another has their purchases, a third has demographics. You need to combine these pieces like assembling a puzzle. R provides tools to stack data (add more rows), widen data (add more columns), and merge data (match by common IDs).

## 9.1 Combining Vectors and Data Frames


``` r
# Combine vectors as columns
x <- c(1, 2, 3)
y <- c(4, 5, 6)
cbind(x, y)

##      x y
## [1,] 1 4
## [2,] 2 5
## [3,] 3 6

# Combine as rows
rbind(x, y)

##   [,1] [,2] [,3]
## x    1    2    3
## y    4    5    6

# Combine data frames by columns
df1 <- data.frame(ID = 1:3, Name = c("A", "B", "C"))
df2 <- data.frame(Score = c(85, 90, 78))
combined_cols <- cbind(df1, df2)
print(combined_cols)

##   ID Name Score
## 1  1    A    85
## 2  2    B    90
## 3  3    C    78

# Combine data frames by rows (must have same columns)
df3 <- data.frame(ID = 4:5, Name = c("D", "E"), Score = c(92, 88))
combined_rows <- rbind(combined_cols, df3)
print(combined_rows)

##   ID Name Score
## 1  1    A    85
## 2  2    B    90
## 3  3    C    78
## 4  4    D    92
## 5  5    E    88

Visualizing Data Combination:

par(mfrow = c(2, 2))

# 1. cbind: Side-by-side combination
vec1 <- c(10, 20, 30)
vec2 <- c(15, 25, 35)
combined_cols_demo <- cbind(vec1, vec2)

barplot(t(combined_cols_demo),
        beside = FALSE,
        main = "cbind(): Combines as Columns\n(Side-by-Side)",
        ylab = "Value",
        names.arg = c("Row1", "Row2", "Row3"),
        col = c("lightblue", "salmon"),
        legend.text = c("vec1", "vec2"),
        args.legend = list(x = "topleft"))

# 2. rbind: Stacking combination
barplot(rbind(vec1, vec2),
        beside = TRUE,
        main = "rbind(): Combines as Rows\n(Stacked)",
        ylab = "Value",
        names.arg = c("Col1", "Col2", "Col3"),
        col = c("lightblue", "salmon"),
        legend.text = c("vec1", "vec2"),
        args.legend = list(x = "topleft"))

# 3. Combining data frames by columns
df_demo1 <- data.frame(
  Student = c("Alice", "Bob", "Carol"),
  Age = c(20, 22, 21)
)
df_demo2 <- data.frame(
  Grade = c(85, 92, 78),
  PassFail = c("Pass", "Pass", "Fail")
)

# Show before
plot(1:2, 1:2, type = "n", xlim = c(0, 10), ylim = c(0, 10),
     axes = FALSE, xlab = "", ylab = "",
     main = "cbind(): Merging Data Frames by Columns")
rect(1, 4, 4, 9, col = "lightblue", border = "black", lwd = 2)
text(2.5, 7.5, "df1\nStudent\nAge", cex = 1.2, font = 2)
text(5, 6.5, "+", cex = 3, font = 2, col = "red")
rect(6, 4, 9, 9, col = "salmon", border = "black", lwd = 2)
text(7.5, 7.5, "df2\nGrade\nPassFail", cex = 1.2, font = 2)
arrows(4.5, 2.5, 4.5, 3.8, lwd = 3, col = "darkgreen")
rect(2, 0.5, 8, 2, col = "lightgreen", border = "black", lwd = 2)
text(5, 1.25, "Result: df1 + df2\n(All Columns)", cex = 1, font = 2)

# 4. Combining data frames by rows
plot(1:2, 1:2, type = "n", xlim = c(0, 10), ylim = c(0, 10),
     axes = FALSE, xlab = "", ylab = "",
     main = "rbind(): Stacking Data Frames by Rows")
rect(2, 7, 8, 9, col = "lightblue", border = "black", lwd = 2)
text(5, 8, "df1 (3 rows)", cex = 1.2, font = 2)
text(5, 6, "+", cex = 3, font = 2, col = "red")
rect(2, 4, 8, 5.5, col = "salmon", border = "black", lwd = 2)
text(5, 4.75, "df2 (2 more rows)", cex = 1.2, font = 2)
arrows(5, 3.2, 5, 3.8, lwd = 3, col = "darkgreen")
rect(2, 0.5, 8, 2.5, col = "lightgreen", border = "black", lwd = 2)
text(5, 1.5, "Result: df1 + df2\n(Total: 5 rows)", cex = 1, font = 2)

par(mfrow = c(1, 1))

cat("\n=== DATA COMBINATION DEMONSTRATION ===\n")

## 
## === DATA COMBINATION DEMONSTRATION ===

cat("\n1. cbind() Example (Column-wise):\n")

## 
## 1. cbind() Example (Column-wise):

cat("  vec1:", vec1, "\n")

##   vec1: 10 20 30

cat("  vec2:", vec2, "\n")

##   vec2: 15 25 35

cat("  Combined:\n")

##   Combined:

print(cbind(vec1, vec2))

##      vec1 vec2
## [1,]   10   15
## [2,]   20   25
## [3,]   30   35

cat("\n2. rbind() Example (Row-wise):\n")

## 
## 2. rbind() Example (Row-wise):

cat("  vec1:", vec1, "\n")

##   vec1: 10 20 30

cat("  vec2:", vec2, "\n")

##   vec2: 15 25 35

cat("  Combined:\n")

##   Combined:

print(rbind(vec1, vec2))

##      [,1] [,2] [,3]
## vec1   10   20   30
## vec2   15   25   35

cat("\n3. Data Frame Column Combination:\n")

## 
## 3. Data Frame Column Combination:

cat("  df1 dimensions:", nrow(df_demo1), "rows ×", ncol(df_demo1), "columns\n")

##   df1 dimensions: 3 rows × 2 columns

cat("  df2 dimensions:", nrow(df_demo2), "rows ×", ncol(df_demo2), "columns\n")

##   df2 dimensions: 3 rows × 2 columns

combined_demo <- cbind(df_demo1, df_demo2)
cat("  Combined dimensions:", nrow(combined_demo), "rows ×", ncol(combined_demo), "columns\n")

##   Combined dimensions: 3 rows × 4 columns

print(combined_demo)

##   Student Age Grade PassFail
## 1   Alice  20    85     Pass
## 2     Bob  22    92     Pass
## 3   Carol  21    78     Fail

Important Rule: For cbind(), data frames must have the same number of rows. For rbind(), data frames must have the same columns (same names and types). If these don’t match, R will throw an error!

9.2 Adding Columns and Rows

In Plain English: Sometimes you don’t want to combine two separate data frames - you just want to add ONE new column or ONE new row to your existing data. It’s like adding a new student to your class roster, or adding a “Final Grade” column to an existing gradebook.

# Add column to data frame
students <- data.frame(
  Name = c("Alice", "Bob", "Carol"),
  Age = c(20, 22, 21)
)

students$Grade <- c(85, 92, 78)
print(students)

##    Name Age Grade
## 1 Alice  20    85
## 2   Bob  22    92
## 3 Carol  21    78

# Add row
new_student <- data.frame(Name = "David", Age = 23, Grade = 88)
students <- rbind(students, new_student)
print(students)

##    Name Age Grade
## 1 Alice  20    85
## 2   Bob  22    92
## 3 Carol  21    78
## 4 David  23    88

Visualizing Adding Data:

par(mfrow = c(2, 2))

# 1. Show step-by-step column addition
# Start with basic data
class_start <- data.frame(
  Name = c("Alice", "Bob", "Carol", "David"),
  Age = c(20, 22, 21, 23)
)

# Add Homework column
class_with_hw <- class_start
class_with_hw$Homework <- c(85, 90, 78, 88)

# Add Exam column
class_complete <- class_with_hw
class_complete$Exam <- c(88, 85, 92, 90)

barplot(ncol(class_start):ncol(class_complete),
        names.arg = c("Start\n(2 cols)", "Add HW\n(3 cols)", "Add Exam\n(4 cols)"),
        main = "Adding Columns Sequentially",
        ylab = "Number of Columns",
        col = c("lightblue", "lightgreen", "salmon"),
        ylim = c(0, 5))
text(x = 1:3 * 1.2 - 0.5,
     y = ncol(class_start):ncol(class_complete) + 0.2,
     labels = ncol(class_start):ncol(class_complete),
     font = 2)

# 2. Show step-by-step row addition
class_3students <- data.frame(
  Name = c("Alice", "Bob", "Carol"),
  Grade = c(85, 90, 78)
)

class_4students <- class_3students
class_4students <- rbind(class_4students, 
                         data.frame(Name = "David", Grade = 88))

class_5students <- class_4students
class_5students <- rbind(class_5students,
                         data.frame(Name = "Eve", Grade = 95))

barplot(c(nrow(class_3students), nrow(class_4students), nrow(class_5students)),
        names.arg = c("Original\n(3 students)", "Add 1\n(4 students)", "Add Another\n(5 students)"),
        main = "Adding Rows Sequentially",
        ylab = "Number of Rows (Students)",
        col = c("lightblue", "lightgreen", "salmon"),
        ylim = c(0, 6))
text(x = 1:3 * 1.2 - 0.5,
     y = c(nrow(class_3students), nrow(class_4students), nrow(class_5students)) + 0.2,
     labels = c(nrow(class_3students), nrow(class_4students), nrow(class_5students)),
     font = 2)

# 3. Visualize grade data before and after adding column
scores_before <- class_start$Age  # Only have Age
scores_after <- class_complete$Homework  # Now have Homework too

barplot(rbind(scores_before, scores_after),
        beside = TRUE,
        names.arg = class_complete$Name,
        main = "Before & After Adding Grade Column",
        ylab = "Value",
        col = c("lightblue", "salmon"),
        legend.text = c("Age (Original)", "Homework (Added)"),
        args.legend = list(x = "topleft"))

# 4. Visualize student data before and after adding rows
original_grades <- c(85, 90, 78)
after_additions <- c(85, 90, 78, 88, 95)

plot(1:length(original_grades), original_grades,
     type = "b",
     pch = 19,
     col = "blue",
     main = "Dataset Growth: Adding Rows",
     xlab = "Student Number",
     ylab = "Grade",
     xlim = c(1, 5),
     ylim = c(70, 100),
     cex = 2)
points((length(original_grades)+1):length(after_additions), 
       after_additions[(length(original_grades)+1):length(after_additions)],
       pch = 19,
       col = "red",
       cex = 2)
lines(1:length(after_additions), after_additions, col = "gray", lty = 2)
legend("bottomright",
       legend = c("Original Students", "Added Students"),
       col = c("blue", "red"),
       pch = 19,
       cex = 1.2)

par(mfrow = c(1, 1))

cat("\n=== ADDING DATA DEMONSTRATION ===\n")

## 
## === ADDING DATA DEMONSTRATION ===

cat("\n1. Adding Columns:\n")

## 
## 1. Adding Columns:

cat("  Original data (Name, Age):\n")

##   Original data (Name, Age):

print(class_start)

##    Name Age
## 1 Alice  20
## 2   Bob  22
## 3 Carol  21
## 4 David  23

cat("\n  After adding Homework and Exam columns:\n")

## 
##   After adding Homework and Exam columns:

print(class_complete)

##    Name Age Homework Exam
## 1 Alice  20       85   88
## 2   Bob  22       90   85
## 3 Carol  21       78   92
## 4 David  23       88   90

cat("\n2. Adding Rows:\n")

## 
## 2. Adding Rows:

cat("  Original data (3 students):\n")

##   Original data (3 students):

print(class_3students)

##    Name Grade
## 1 Alice    85
## 2   Bob    90
## 3 Carol    78

cat("\n  After adding 2 more students:\n")

## 
##   After adding 2 more students:

print(class_5students)

##    Name Grade
## 1 Alice    85
## 2   Bob    90
## 3 Carol    78
## 4 David    88
## 5   Eve    95

cat("\n3. Tracking Growth:\n")

## 
## 3. Tracking Growth:

cat("  Started with:", nrow(class_3students), "rows ×", ncol(class_start), "columns\n")

##   Started with: 3 rows × 2 columns

cat("  Ended with:", nrow(class_5students), "rows ×", ncol(class_complete), "columns\n")

##   Ended with: 5 rows × 4 columns

cat("  Added:", nrow(class_5students) - nrow(class_3students), "rows and",
    ncol(class_complete) - ncol(class_start), "columns\n")

##   Added: 2 rows and 2 columns

Pro Tip on Growing Data: In R, repeatedly adding rows with rbind() in a loop is SLOW for large datasets (R creates a new copy each time). If you’re adding many rows, it’s better to create a list of data frames and combine them all at once using do.call(rbind, my_list). But for small datasets in Week 1, don’t worry about this optimization!

Part 10: Practical Example - Palmer Penguins

In Plain English: Now let’s apply EVERYTHING you’ve learned to a real dataset! The Palmer Penguins dataset contains measurements of 344 penguins from 3 species in Antarctica. This is REAL scientific data - not made-up numbers. You’ll see how all the concepts (vectors, data frames, filtering, summarizing, visualizing) work together in actual data analysis.

Let’s put everything together with a real dataset:

# Install and load palmerpenguins package
# install.packages("palmerpenguins")
library(palmerpenguins)

# Load the penguins dataset
data(penguins)

# Explore the data
str(penguins)

## tibble [344 × 8] (S3: tbl_df/tbl/data.frame)
##  $ species          : Factor w/ 3 levels "Adelie","Chinstrap",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ island           : Factor w/ 3 levels "Biscoe","Dream",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ bill_length_mm   : num [1:344] 39.1 39.5 40.3 NA 36.7 39.3 38.9 39.2 34.1 42 ...
##  $ bill_depth_mm    : num [1:344] 18.7 17.4 18 NA 19.3 20.6 17.8 19.6 18.1 20.2 ...
##  $ flipper_length_mm: int [1:344] 181 186 195 NA 193 190 181 195 193 190 ...
##  $ body_mass_g      : int [1:344] 3750 3800 3250 NA 3450 3650 3625 4675 3475 4250 ...
##  $ sex              : Factor w/ 2 levels "female","male": 2 1 1 NA 1 2 1 2 NA NA ...
##  $ year             : int [1:344] 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 ...

head(penguins)

summary(penguins)

##       species          island    bill_length_mm  bill_depth_mm  
##  Adelie   :152   Biscoe   :168   Min.   :32.10   Min.   :13.10  
##  Chinstrap: 68   Dream    :124   1st Qu.:39.23   1st Qu.:15.60  
##  Gentoo   :124   Torgersen: 52   Median :44.45   Median :17.30  
##                                  Mean   :43.92   Mean   :17.15  
##                                  3rd Qu.:48.50   3rd Qu.:18.70  
##                                  Max.   :59.60   Max.   :21.50  
##                                  NA's   :2       NA's   :2      
##  flipper_length_mm  body_mass_g       sex           year     
##  Min.   :172.0     Min.   :2700   female:165   Min.   :2007  
##  1st Qu.:190.0     1st Qu.:3550   male  :168   1st Qu.:2007  
##  Median :197.0     Median :4050   NA's  : 11   Median :2008  
##  Mean   :200.9     Mean   :4202                Mean   :2008  
##  3rd Qu.:213.0     3rd Qu.:4750                3rd Qu.:2009  
##  Max.   :231.0     Max.   :6300                Max.   :2009  
##  NA's   :2         NA's   :2

# How many penguins of each species?
table(penguins$species)

## 
##    Adelie Chinstrap    Gentoo 
##       152        68       124

# Average bill length by species
aggregate(bill_length_mm ~ species, data = penguins, FUN = mean, na.rm = TRUE)

# Filter for Adelie penguins only
adelie <- subset(penguins, species == "Adelie")
nrow(adelie)

## [1] 152

# Penguins with bill length > 50mm
long_bills <- subset(penguins, bill_length_mm > 50)
nrow(long_bills)

## [1] 52

# Create a new variable: bill aspect ratio
penguins$bill_ratio <- penguins$bill_length_mm / penguins$bill_depth_mm
head(penguins$bill_ratio)

## [1] 2.090909 2.270115 2.238889       NA 1.901554 1.907767

# Summary of body mass by sex
aggregate(body_mass_g ~ sex, data = penguins, FUN = mean, na.rm = TRUE)

Comprehensive Penguin Data Visualization:

# Remove NAs for cleaner visualizations
penguins_clean <- na.omit(penguins)

par(mfrow = c(3, 3))

# 1. Species distribution
species_counts <- table(penguins$species)
barplot(species_counts,
        main = "Penguin Species Distribution",
        ylab = "Count",
        col = c("darkorange", "purple", "cyan4"),
        ylim = c(0, max(species_counts) + 20))
text(x = 1:3 * 1.2 - 0.5,
     y = species_counts + 5,
     labels = species_counts,
     font = 2)

# 2. Bill length by species
boxplot(bill_length_mm ~ species, data = penguins_clean,
        main = "Bill Length by Species",
        xlab = "Species",
        ylab = "Bill Length (mm)",
        col = c("darkorange", "purple", "cyan4"))

# 3. Bill depth by species
boxplot(bill_depth_mm ~ species, data = penguins_clean,
        main = "Bill Depth by Species",
        xlab = "Species",
        ylab = "Bill Depth (mm)",
        col = c("darkorange", "purple", "cyan4"))

# 4. Body mass distribution
hist(penguins_clean$body_mass_g,
     main = "Body Mass Distribution",
     xlab = "Body Mass (g)",
     col = "lightblue",
     breaks = 20)
abline(v = mean(penguins_clean$body_mass_g), col = "red", lwd = 2, lty = 2)

# 5. Flipper length vs Body mass (scatter)
plot(penguins_clean$body_mass_g, penguins_clean$flipper_length_mm,
     main = "Body Mass vs Flipper Length",
     xlab = "Body Mass (g)",
     ylab = "Flipper Length (mm)",
     pch = 19,
     col = as.numeric(penguins_clean$species))
legend("bottomright",
       legend = levels(penguins_clean$species),
       col = 1:3,
       pch = 19,
       cex = 0.8)

# 6. Bill dimensions scatter plot
plot(penguins_clean$bill_length_mm, penguins_clean$bill_depth_mm,
     main = "Bill Length vs Depth",
     xlab = "Bill Length (mm)",
     ylab = "Bill Depth (mm)",
     pch = 19,
     col = c("darkorange", "purple", "cyan4")[as.numeric(penguins_clean$species)])
legend("topright",
       legend = levels(penguins_clean$species),
       col = c("darkorange", "purple", "cyan4"),
       pch = 19,
       cex = 0.8)

# 7. Body mass by species and sex
aggregate_mass <- aggregate(body_mass_g ~ species + sex, 
                           data = penguins_clean, 
                           FUN = mean)
mass_matrix <- matrix(aggregate_mass$body_mass_g, nrow = 2, ncol = 3)
colnames(mass_matrix) <- levels(penguins_clean$species)
rownames(mass_matrix) <- c("Female", "Male")

barplot(mass_matrix,
        beside = TRUE,
        main = "Body Mass by Species & Sex",
        ylab = "Body Mass (g)",
        col = c("pink", "lightblue"),
        legend.text = TRUE,
        args.legend = list(x = "topleft"))

# 8. Island distribution
island_counts <- table(penguins$island)
barplot(island_counts,
        main = "Penguins by Island",
        ylab = "Count",
        col = terrain.colors(3),
        las = 2)

# 9. Flipper length by species
boxplot(flipper_length_mm ~ species, data = penguins_clean,
        main = "Flipper Length by Species",
        xlab = "Species",
        ylab = "Flipper Length (mm)",
        col = c("darkorange", "purple", "cyan4"))

par(mfrow = c(1, 1))

# Create comprehensive summary table
cat("\n=== PALMER PENGUINS DATA ANALYSIS ===\n")

## 
## === PALMER PENGUINS DATA ANALYSIS ===

cat("\nDataset Overview:\n")

## 
## Dataset Overview:

cat("  Total penguins:", nrow(penguins), "\n")

##   Total penguins: 344

cat("  Complete cases:", nrow(penguins_clean), "\n")

##   Complete cases: 333

cat("  Number of species:", length(unique(penguins$species)), "\n")

##   Number of species: 3

cat("  Number of islands:", length(unique(penguins$island)), "\n")

##   Number of islands: 3

cat("  Years covered:", min(penguins$year), "to", max(penguins$year), "\n")

##   Years covered: 2007 to 2009

cat("\nSpecies Counts:\n")

## 
## Species Counts:

for(sp in names(species_counts)) {
  cat("  ", sp, ":", species_counts[sp], "\n")
}

##    Adelie : 152 
##    Chinstrap : 68 
##    Gentoo : 124

cat("\nAverage Measurements by Species:\n")

## 
## Average Measurements by Species:

for(sp in levels(penguins_clean$species)) {
  subset_sp <- subset(penguins_clean, species == sp)
  cat("\n  ", sp, "Penguins:\n")
  cat("    Bill Length:", round(mean(subset_sp$bill_length_mm), 1), "mm\n")
  cat("    Bill Depth:", round(mean(subset_sp$bill_depth_mm), 1), "mm\n")
  cat("    Flipper Length:", round(mean(subset_sp$flipper_length_mm), 1), "mm\n")
  cat("    Body Mass:", round(mean(subset_sp$body_mass_g), 0), "g\n")
}

## 
##    Adelie Penguins:
##     Bill Length: 38.8 mm
##     Bill Depth: 18.3 mm
##     Flipper Length: 190.1 mm
##     Body Mass: 3706 g
## 
##    Chinstrap Penguins:
##     Bill Length: 48.8 mm
##     Bill Depth: 18.4 mm
##     Flipper Length: 195.8 mm
##     Body Mass: 3733 g
## 
##    Gentoo Penguins:
##     Bill Length: 47.6 mm
##     Bill Depth: 15 mm
##     Flipper Length: 217.2 mm
##     Body Mass: 5092 g

cat("\nSex Differences (Overall):\n")

## 
## Sex Differences (Overall):

sex_comparison <- aggregate(body_mass_g ~ sex, data = penguins_clean, FUN = mean)
for(i in 1:nrow(sex_comparison)) {
  cat("  ", sex_comparison$sex[i], ":", round(sex_comparison$body_mass_g[i], 0), "g\n")
}

##    1 : 3862 g
##    2 : 4546 g

cat("\nInteresting Facts:\n")

## 
## Interesting Facts:

cat("  Heaviest penguin:", max(penguins_clean$body_mass_g), "g\n")

##   Heaviest penguin: 6300 g

cat("  Lightest penguin:", min(penguins_clean$body_mass_g), "g\n")

##   Lightest penguin: 2700 g

cat("  Longest bill:", max(penguins_clean$bill_length_mm), "mm\n")

##   Longest bill: 59.6 mm

cat("  Longest flippers:", max(penguins_clean$flipper_length_mm), "mm\n")

##   Longest flippers: 231 mm

What This Demonstrates: This analysis shows you the POWER of R! With just a few commands, you’ve: 1. Loaded real scientific data (344 penguins!) 2. Explored its structure (str(), head(), summary()) 3. Filtered subsets (Adelie penguins, long bills) 4. Created new variables (bill ratio) 5. Calculated group statistics (aggregate()) 6. Made NINE different visualizations showing patterns

This is exactly what data scientists do every day - and you just did it in Week 1!

Summary

Key Takeaways

R Basics
- R is free, powerful, and reproducible
- RStudio makes R easier to use
- Objects store values with <-
Data Types
- Numeric, character, logical, factor
- Special values: NA, NULL, Inf, NaN
Data Structures
- Vectors: 1-dimensional, same type
- Matrices: 2-dimensional, same type
- Data frames: 2-dimensional, different types
- Lists: flexible, can contain anything
Working with Data
- Import with read.csv() or rio::import()
- Explore with head(), str(), summary()
- Subset with [ ], $, or subset()
Missing Data
- Identify with is.na()
- Remove with na.omit() or na.rm = TRUE
- Understand why data is missing (MCAR, MAR, MNAR)
Functions
- Use built-in functions for analysis
- Write your own functions for custom tasks
- Get help with ?function_name

Next Steps

Week 2 will build on these fundamentals:

Advanced data manipulation with dplyr
Creating visualizations with ggplot2
More statistical analyses
Hypothesis testing basics

Practice Resources

Lab Assignment: 01_lab_R_learning.Rmd (30 exercises)
RStudio Cheatsheets: https://posit.co/resources/cheatsheets/
Quick-R: https://www.statmethods.net/
Stack Overflow: https://stackoverflow.com/questions/tagged/r

References

Field, A., Miles, J., & Field, Z. (2012). Discovering Statistics Using R. SAGE Publications Ltd., London.

R Core Team (2024). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. https://www.R-project.org/

Wickham, H. (2016). ggplot2: Elegant Graphics for Data Analysis (2nd ed.). Springer.

Horst, A. M., Hill, A. P., & Gorman, K. B. (2020). palmerpenguins: Palmer Archipelago (Antarctica) penguin data. R package version 0.1.0. https://allisonhorst.github.io/palmerpenguins/

Appendix: Additional Resources

Useful Packages for Beginners

# Data manipulation
install.packages("dplyr")
install.packages("tidyr")

# Visualization
install.packages("ggplot2")

# Data import
install.packages("rio")
install.packages("readr")
install.packages("haven")       # For SPSS, Stata, SAS

# Reporting
install.packages("rmarkdown")
install.packages("knitr")

# Practice datasets
install.packages("palmerpenguins")
install.packages("datasets")

Common Keyboard Shortcuts (RStudio)

Ctrl/Cmd + Enter: Run current line/selection
Ctrl/Cmd + Shift + Enter: Run entire script
Ctrl/Cmd + S: Save file
Ctrl/Cmd + Shift + C: Comment/uncomment
Alt/Option + -: Insert <- assignment operator
Tab: Auto-complete function/variable names

Where to Get Help

Built-in help: ?function_name or help(function_name)
RStudio Community: https://community.rstudio.com/
Stack Overflow: Tag questions with [r]
R Documentation: https://www.rdocumentation.org/
Your instructor: Office hours or course forum

Tips for Success

Practice regularly - Programming is a skill learned by doing
Type code yourself - Don’t just copy-paste
Read error messages carefully - They’re trying to help!
Use comments - Explain your code with #
Save often - Use Ctrl/Cmd + S frequently
Start simple - Break complex problems into smaller steps
Ask for help - Everyone starts as a beginner!

Remember: R has a steep learning curve at first, but it gets easier with practice. You’re building a valuable skill that will serve you throughout your analytics career!

Document generated on March 11, 2026

Week 01: Introduction to R

Ziyuan Huang

Last Updated: March 11, 2026