library(seedhash)
gen <- SeedHashGenerator$new("ANLY530-Week03-RandomForest")
MASTER_SEED <- gen$generate_seeds(1)
set.seed(MASTER_SEED)

Random Seed Source: seedhash package — input string: "ANLY530-Week03-RandomForest" → seed: 694647825

Introduction

Last week, we learned that a single Decision Tree — while intuitive — is unstable: a small change in the data can produce a completely different tree. This week, we fix that by combining many trees into a forest.

“If you aggregate the predictions of a group of predictors (such as classifiers or regressors), you will often get better predictions than with the best individual predictor. A group of predictors is called an ensemble; thus, this technique is called Ensemble Learning, and an Ensemble Learning algorithm is called an Ensemble method.” — A. Géron [1]

“Random forests are a modification of bagged decision trees that build a large collection of de-correlated trees to further improve predictive performance.” — B. C. Boehmke & B. M. Greenwell [2]

This is the wisdom of the crowd in action. Thousands of people independently guessing the weight of an ox will, on average, be surprisingly accurate — more accurate than any single expert. The same principle makes ensembles so powerful.

Learning Objectives:

By the end of this tutorial, you will be able to:

Explain the intuition behind ensemble learning and why diversity matters.
Distinguish hard voting from soft voting classifiers.
Implement Bagging and understand Out-of-Bag (OOB) evaluation.
Build, tune, and interpret a Random Forest in R.
Explain how feature importance is computed in a Random Forest.
Understand Extra-Trees and when to prefer them over Random Forests.
Compare Random Forests to Boosting (AdaBoost, Gradient Boosting).
Apply Random Forests to both classification and regression tasks.

Required Packages

if (!require("pacman")) install.packages("pacman")
pacman::p_load(
  tidyverse,
  randomForest,
  rpart,
  rpart.plot,
  gridExtra,
  gbm,
  scales
)

Part 1: The Wisdom of the Crowd — Voting Classifiers

1.1 Why Ensembles Work

Suppose you flip a slightly biased coin that lands heads 51% of the time. Flip it 1,000 times — you’ll get roughly 510 heads (majority heads). Flip it 10,000 times — the probability of a majority of heads climbs to over 97%. This is the Law of Large Numbers in action.

The same logic applies to classifiers. If each classifier is correct 51% of the time (barely better than random), but each makes different mistakes, combining 1,000 of them can yield 75% accuracy!

The key requirement: classifiers must be as independent as possible (i.e., make different types of errors).

simulate_ensemble_accuracy <- function(n_classifiers, base_acc) {
  # P(majority correct) for n independent classifiers each with P(correct) = base_acc
  # Using binomial distribution: P(k > n/2 correct out of n)
  k_needed <- floor(n_classifiers / 2) + 1
  sum(dbinom(k_needed:n_classifiers, n_classifiers, base_acc))
}

classifiers <- c(1, 3, 5, 11, 21, 51, 101, 501, 1001)
results <- data.frame(
  n = classifiers,
  acc_55 = sapply(classifiers, simulate_ensemble_accuracy, base_acc = 0.55),
  acc_60 = sapply(classifiers, simulate_ensemble_accuracy, base_acc = 0.60),
  acc_70 = sapply(classifiers, simulate_ensemble_accuracy, base_acc = 0.70)
) %>%
  pivot_longer(-n, names_to = "base", values_to = "ensemble_acc") %>%
  mutate(base = recode(base,
    acc_55 = "55% base accuracy",
    acc_60 = "60% base accuracy",
    acc_70 = "70% base accuracy"
  ))

ggplot(results, aes(x = n, y = ensemble_acc, color = base)) +
  geom_line(linewidth = 1.2) +
  geom_point(size = 2) +
  geom_hline(yintercept = 0.9, linetype = "dashed", color = "grey40") +
  scale_x_log10(breaks = classifiers, labels = as.character(classifiers)) +
  scale_color_viridis_d(option = "plasma", begin = 0.2, end = 0.85) +
  labs(
    title = "Ensemble Accuracy vs. Number of Independent Classifiers",
    subtitle = "The more classifiers, the closer to certainty — given they make independent errors",
    x = "Number of Classifiers (log scale)",
    y = "Ensemble Accuracy",
    color = "Base Classifier",
    caption = "Dashed line = 90% accuracy threshold"
  ) +
  theme_minimal() +
  theme(legend.position = "bottom")

Key Insight: Even a weak 55% classifier, when combined with 501 others (making independent errors), reaches ~90% ensemble accuracy. The diversity of errors — not the strength of any individual classifier — is the key.

1.2 Hard Voting vs. Soft Voting

There are two ways to aggregate predictions in a voting classifier:

Method	Mechanism	Requires
Hard Voting	Take the most frequent predicted class (majority vote)	Only class labels
Soft Voting	Average predicted probabilities, pick highest	`predict_proba()` for each model

Soft voting often beats hard voting because it gives more weight to high-confidence predictions. A classifier that says “99% Class A” counts for more than one that says “51% Class A”.

Let’s simulate this on the Iris dataset using three diverse base learners:

data(iris)

# Use sepal features only — deliberately harder (versicolor/virginica overlap here)
# Petal features are too perfectly separable → all classifiers agree on everything → no
# visible difference between individual and ensemble, which defeats the tutorial's point.
iris_vote <- iris[, c("Sepal.Length", "Sepal.Width", "Species")]

# Train/test split (70/30)
set.seed(MASTER_SEED)
idx <- sample(1:nrow(iris_vote), size = floor(0.7 * nrow(iris_vote)))
train <- iris_vote[idx, ]
test  <- iris_vote[-idx, ]

# --- Three diverse base learners ---
# 1. Shallow Decision Tree (high bias, low variance)
tree_model <- rpart(Species ~ ., data = train, control = rpart.control(maxdepth = 2))

# 2. Random Forest (low bias, low variance — strong learner)
rf_model   <- randomForest(Species ~ ., data = train, ntree = 200)

# 3. LDA (Linear Discriminant Analysis) — linear boundary, different error pattern
lda_model  <- MASS::lda(Species ~ ., data = train)

# --- Individual predictions ---
pred_tree <- predict(tree_model, test, type = "class")
pred_rf   <- predict(rf_model,   test, type = "class")
pred_lda  <- predict(lda_model,  test)$class

# --- Hard Voting: true majority vote across 3 classifiers ---
hard_vote_mat <- data.frame(
  tree = as.character(pred_tree),
  rf   = as.character(pred_rf),
  lda  = as.character(pred_lda)
)
hard_vote <- apply(hard_vote_mat, 1, function(row) names(which.max(table(row))))
hard_vote <- factor(hard_vote, levels = levels(test$Species))

# --- Soft Voting: average class probabilities across 3 classifiers ---
prob_tree <- predict(tree_model, test, type = "prob")
prob_rf   <- predict(rf_model,   test, type = "prob")
prob_lda  <- predict(lda_model,  test)$posterior

avg_prob  <- (prob_tree + prob_rf + prob_lda) / 3
soft_vote <- factor(colnames(avg_prob)[apply(avg_prob, 1, which.max)],
                    levels = levels(test$Species))

# --- Accuracy ---
acc_tree <- mean(pred_tree == test$Species)
acc_rf   <- mean(pred_rf   == test$Species)
acc_lda  <- mean(pred_lda  == test$Species)
acc_hard <- mean(hard_vote == test$Species)
acc_soft <- mean(soft_vote == test$Species)

acc_df <- data.frame(
  Model    = c("Decision Tree", "LDA", "Random Forest", "Hard Vote (3-way)", "Soft Vote (3-way)"),
  Accuracy = c(acc_tree, acc_lda, acc_rf, acc_hard, acc_soft)
)

ggplot(acc_df, aes(x = reorder(Model, Accuracy), y = Accuracy, fill = Model)) +
  geom_col(width = 0.6) +
  geom_text(aes(label = scales::percent(Accuracy, accuracy = 0.1)),
            hjust = -0.1, fontface = "bold") +
  coord_flip(ylim = c(0.4, 1.05)) +
  scale_fill_viridis_d(option = "plasma", begin = 0.2, end = 0.85) +
  labs(
    title = "Classifier Accuracy on Iris Test Set (Sepal Features Only)",
    subtitle = "Voting classifiers combine the strengths of individual models — sepal features make this a harder task",
    x = NULL, y = "Accuracy"
  ) +
  theme_minimal() +
  theme(legend.position = "none")

Part 2: Bagging — Same Algorithm, Different Data

2.1 What Is Bagging?

Instead of training different algorithms on the same data, bagging (Bootstrap AGGregating) trains the same algorithm on different random bootstrapped subsets of the training data.

Bagging: sample with replacement (bootstrap). Each predictor may see ~63% of unique training instances.
Pasting: sample without replacement.

2.2 Why ~63%? The Out-of-Bag Math

When bootstrapping a dataset of size $m$ with replacement, the probability that any single instance is not selected in one draw is $\frac{m-1}{m}$. After $m$ draws:

\[P(\text{not selected in any draw}) = \left(1 - \frac{1}{m}\right)^m \xrightarrow{m \to \infty} e^{-1} \approx 0.368\]

So each bootstrap sample leaves out about 37% of instances — these are the Out-of-Bag (OOB) instances.

Visualizing the Bootstrap: Which Instances Go Where?

The heatmap below shows a small example (20 training instances, 10 trees). Each cell is colored blue if instance $i$ is in the bootstrap sample for tree $j$, and orange if it is OOB (held-out for that tree).

set.seed(MASTER_SEED)
n_inst  <- 20   # training instances
n_tr    <- 10   # trees to show

# Build in-bag (1) vs OOB (0) matrix
inbag_mat <- matrix(0L, nrow = n_inst, ncol = n_tr)
for (j in 1:n_tr) {
  selected <- sample(1:n_inst, n_inst, replace = TRUE)
  inbag_mat[unique(selected), j] <- 1L
}

# Compute each instance's OOB count
oob_counts <- rowSums(inbag_mat == 0)

inbag_df <- as.data.frame(inbag_mat) %>%
  mutate(Instance = 1:n_inst) %>%
  pivot_longer(-Instance, names_to = "Tree", values_to = "Status") %>%
  mutate(
    Tree   = factor(as.integer(gsub("V", "", Tree))),
    Status = factor(Status, levels = c(0, 1), labels = c("OOB (held-out)", "In-Bag (training)"))
  )

ggplot(inbag_df, aes(x = Tree, y = factor(Instance), fill = Status)) +
  geom_tile(color = "white", linewidth = 0.5) +
  scale_fill_manual(values = c("OOB (held-out)" = "#F28E2B", "In-Bag (training)" = "#4E79A7")) +
  labs(
    title = "Bootstrap Sample Composition Across Trees",
    subtitle = sprintf("N=%d instances, %d trees — orange cells = OOB (never seen by that tree)",
                       n_inst, n_tr),
    x = "Tree Index", y = "Training Instance",
    fill = NULL
  ) +
  theme_minimal() +
  theme(legend.position = "top",
        axis.text.y = element_text(size = 8),
        panel.grid = element_blank())

Key observations: - No instance is OOB for all trees — every instance is seen by most trees. - On average, each tree sees ~63% of instances (blue) and skips ~37% (orange). - Each tree’s OOB instances form its private validation set. Averaging OOB predictions across all trees gives the OOB error estimate — free cross-validation with no extra data!

m_values <- seq(2, 500, by = 5)
oob_frac  <- (1 - 1 / m_values)^m_values

ggplot(data.frame(m = m_values, oob = oob_frac), aes(x = m, y = oob)) +
  geom_line(color = "#4E79A7", linewidth = 1.2) +
  geom_hline(yintercept = exp(-1), linetype = "dashed", color = "#E15759") +
  annotate("text", x = 400, y = exp(-1) + 0.015,
           label = sprintf("e⁻¹ ≈ %.3f", exp(-1)),
           color = "#E15759", fontface = "bold") +
  labs(
    title = "Fraction of OOB Instances vs. Training Set Size",
    subtitle = "Converges quickly to 36.8% — a free validation set!",
    x = "Training Set Size (m)", y = "OOB Fraction"
  ) +
  theme_minimal()

2.3 Manual Bagging in R

set.seed(MASTER_SEED)
n_trees <- 200

# Binary classification: versicolor vs virginica — deliberately hard (real sepal overlap)
# Setosa vs other is nearly perfectly separable → all bootstrap trees identical → flat line
iris_bin <- iris %>%
  filter(Species %in% c("versicolor", "virginica")) %>%
  mutate(label = droplevels(Species)) %>%
  select(Sepal.Length, Sepal.Width, label)

idx_tr    <- sample(1:nrow(iris_bin), 70)
train_bin <- iris_bin[idx_tr, ]
test_bin  <- iris_bin[-idx_tr, ]

# Train bagged ensemble manually
bag_preds <- matrix(NA, nrow = nrow(test_bin), ncol = n_trees)

for (i in 1:n_trees) {
  boot_idx  <- sample(1:nrow(train_bin), nrow(train_bin), replace = TRUE)
  boot_data <- train_bin[boot_idx, ]
  tree_i    <- rpart(label ~ ., data = boot_data, control = rpart.control(maxdepth = 4))
  bag_preds[, i] <- as.character(predict(tree_i, test_bin, type = "class"))
}

# Hard voting: majority class at each k trees
ensemble_acc <- sapply(1:n_trees, function(k) {
  votes <- bag_preds[, 1:k, drop = FALSE]
  final <- apply(votes, 1, function(row) {
    tbl <- table(row)
    names(which.max(tbl))
  })
  mean(final == as.character(test_bin$label))
})

single_tree <- rpart(label ~ ., data = train_bin, control = rpart.control(maxdepth = 4))
single_acc  <- mean(predict(single_tree, test_bin, type = "class") == test_bin$label)

ggplot(data.frame(n = 1:n_trees, acc = ensemble_acc), aes(x = n, y = acc)) +
  geom_line(color = "#4E79A7", linewidth = 1) +
  geom_hline(yintercept = single_acc, linetype = "dashed", color = "#E15759", linewidth = 1) +
  annotate("text", x = 155, y = single_acc - 0.02,
           label = sprintf("Single Tree: %.1f%%", single_acc * 100),
           color = "#E15759", fontface = "bold") +
  labs(
    title = "Bagging: Ensemble Accuracy as Trees Are Added",
    subtitle = "Versicolor vs Virginica (genuinely overlapping classes) — bagging reduces variance and beats the single tree",
    x = "Number of Trees in Ensemble", y = "Test Accuracy"
  ) +
  coord_cartesian(ylim = c(0.6, 1.0)) +
  theme_minimal()

Part 3: Random Forests — Decorrelated Trees

3.1 From Bagging to Random Forests

Bagging reduces variance (instability) by averaging many trees. But there’s still a problem: if one or two features are very strong predictors, most trees will split on them first — making the trees correlated with each other. Correlated trees don’t reduce variance as effectively.

Random Forests solve this with one simple trick:

At each split, only a random subset of features (typically $\sqrt{p}$ for classification, $p/3$ for regression) is considered as split candidates.

This decorrelates the trees — even if Feature A is the best predictor overall, it won’t dominate every tree, because sometimes it won’t even be in the candidate set!

Visualizing Tree Diversity: Approximating Six Trees from the Forest

R’s randomForest package does not support direct plotting of its internal trees. Below we approximate the visualization by training rpart trees on the same bootstrap samples that the RF used. With only 2 features (petal only) and mtry = 1, different bootstrap samples naturally produce different tree structures — illustrating the diversity that makes ensembles powerful.

set.seed(MASTER_SEED)
# Build a small RF on the 2-feature iris (petal only) for easy visualization
iris_2feat <- iris[, c("Petal.Length", "Petal.Width", "Species")]
train_2f   <- iris_2feat[train_idx, ]

rf_small <- randomForest(
  Species ~ .,
  data       = train_2f,
  ntree      = 100,
  mtry       = 1,       # only 1 feature per split → maximum diversity
  importance = FALSE,
  keep.inbag = TRUE
)

# Extract and plot 6 individual trees
# Since rpart can't enforce mtry, we randomly select 1 feature per bootstrap sample
# to approximate the RF's mtry=1 constraint for visualization purposes
par(mfrow = c(2, 3), mar = c(1, 1, 2, 1))
all_feats <- c("Petal.Length", "Petal.Width")
for (tree_id in c(1, 5, 15, 30, 60, 100)) {
  tree_df <- getTree(rf_small, k = tree_id, labelVar = TRUE)
  n_leaves <- sum(tree_df$status == -1)
  
  boot_obs <- which(rf_small$inbag[, tree_id] > 0)
  if (length(unique(train_2f$Species[boot_obs])) > 1 && length(boot_obs) > 5) {
    # With only 2 features and mtry=1, each split considers only 1 random feature
    # rpart with the full bootstrap sample approximates the structure
    t_i <- rpart(Species ~ .,
                 data    = train_2f[boot_obs, ],
                 control = rpart.control(maxdepth = 4, minsplit = 2, cp = 0))
    rpart.plot(t_i,
               main        = sprintf("Tree #%d  (%d leaves)", tree_id, n_leaves),
               type        = 2,
               extra       = 104,
               box.palette = "RdYlGn",
               shadow.col  = "grey80",
               cex         = 0.7)
  }
}

par(mfrow = c(1, 1))

What to notice: Every tree splits on different features and thresholds. Tree #1 might root on Petal.Width, while Tree #60 roots on Petal.Length. This structural diversity is what makes the ensemble more robust — correlated errors cancel out when averaged.

Tree-to-Tree Correlation: Bagging vs. Random Forests

The power of RF over plain Bagging comes from reducing inter-tree correlation. When trees are correlated, the variance of the average is:

\[\text{Var}\!\left(\bar{T}\right) = \rho \sigma^2 + \frac{1-\rho}{B}\sigma^2\]

As $B \to \infty$, the second term vanishes but $\rho \sigma^2$ remains. Lower correlation → lower ensemble variance.

set.seed(MASTER_SEED)
n_trees_corr <- 30

# Bagging: all 4 features at each split (mtry = 4)
rf_bag <- randomForest(Species ~ ., data = iris_train, ntree = n_trees_corr,
                       mtry = ncol(iris_train) - 1, importance = FALSE)

# RF: sqrt(4) = 2 features per split
rf_rf  <- randomForest(Species ~ ., data = iris_train, ntree = n_trees_corr,
                       mtry = 2, importance = FALSE)

# Individual tree predictions: randomForest returns a CHARACTER matrix of class names.
# as.numeric("setosa") is NA — must map to 1/2/3 first or cor() is all NA (grey heatmap).
get_tree_preds <- function(rf_obj, newdata) {
  p <- predict(rf_obj, newdata, predict.all = TRUE)
  ind <- p$individual
  lv  <- levels(rf_obj$y)
  matrix(match(as.vector(ind), lv), nrow = nrow(ind), ncol = ncol(ind))
}

bag_preds_mat <- get_tree_preds(rf_bag, iris_test)
rf_preds_mat  <- get_tree_preds(rf_rf,  iris_test)

# Pairwise Pearson correlation between trees' prediction vectors (numeric class codes)
compute_pair_corr <- function(pred_mat) {
  n <- ncol(pred_mat)
  cors <- matrix(NA_real_, n, n)
  for (i in seq_len(n)) for (j in seq_len(n)) {
    cors[i, j] <- cor(pred_mat[, i], pred_mat[, j], use = "complete.obs")
  }
  diag(cors) <- 1
  cors
}

bag_corr <- compute_pair_corr(bag_preds_mat)
rf_corr  <- compute_pair_corr(rf_preds_mat)

corr_to_df <- function(m, label) {
  as.data.frame(m) %>%
    mutate(Tree_i = 1:nrow(m)) %>%
    pivot_longer(-Tree_i, names_to = "Tree_j_str", values_to = "Correlation") %>%
    mutate(Tree_j = as.integer(gsub("V", "", Tree_j_str)), Model = label)
}

corr_df <- bind_rows(
  corr_to_df(bag_corr, sprintf("Bagging (mtry=%d, all features)", ncol(iris_train)-1)),
  corr_to_df(rf_corr,  "Random Forest (mtry=2)")
)

p_bag <- ggplot(filter(corr_df, Model == sprintf("Bagging (mtry=%d, all features)", ncol(iris_train)-1)),
       aes(x = factor(Tree_i), y = factor(Tree_j), fill = Correlation)) +
  geom_tile() +
  scale_fill_gradient2(low = "#E15759", mid = "white", high = "#4E79A7",
                       midpoint = 0.5, limits = c(0, 1)) +
  labs(title = sprintf("Bagging (mtry=4)\nMean corr: %.3f",
                       mean(bag_corr[upper.tri(bag_corr)], na.rm = TRUE)),
       x = "Tree i", y = "Tree j") +
  theme_minimal(base_size = 9) +
  theme(axis.text = element_blank(), legend.position = "none")

p_rf <- ggplot(filter(corr_df, Model == "Random Forest (mtry=2)"),
       aes(x = factor(Tree_i), y = factor(Tree_j), fill = Correlation)) +
  geom_tile() +
  scale_fill_gradient2(low = "#E15759", mid = "white", high = "#4E79A7",
                       midpoint = 0.5, limits = c(0, 1)) +
  labs(title = sprintf("Random Forest (mtry=2)\nMean corr: %.3f",
                       mean(rf_corr[upper.tri(rf_corr)], na.rm = TRUE)),
       x = "Tree i", y = "Tree j", fill = "Correlation") +
  theme_minimal(base_size = 9) +
  theme(axis.text = element_blank())

grid.arrange(p_bag, p_rf, ncol = 2,
  top = "Inter-Tree Prediction Correlation — Bagging vs. Random Forest\n(Lower correlation = better variance reduction)")

Note: randomForest::predict(..., predict.all = TRUE)$individual is a character matrix of class names. Pearson correlation needs numeric codes — we map levels to 1, 2, 3 (a common shortcut for comparing agreement between classifiers).

Result: Bagging trees are more correlated with each other (darker blue throughout) because they all tend to split on the same dominant features. RF trees have lower correlation — the random feature subsets force them to find different patterns, which is the source of RF’s variance reduction advantage.

3.2 Building a Random Forest in R

R’s randomForest package is the standard tool. The key hyperparameters are:

Parameter	Meaning	Default
`ntree`	Number of trees	500
`mtry`	Features considered per split	$\sqrt{p}$ (classification)
`nodesize`	Minimum terminal node size	1 (classification)
`maxnodes`	Maximum number of leaf nodes	NULL (unlimited)

# Fit Random Forest (train_idx/iris_train/iris_test defined in rf-data-prep chunk)
rf_iris <- randomForest(
  Species ~ .,
  data     = iris_train,
  ntree    = 500,
  mtry     = 2,            # sqrt(4 features) ≈ 2
  importance = TRUE
)

print(rf_iris)

## 
## Call:
##  randomForest(formula = Species ~ ., data = iris_train, ntree = 500,      mtry = 2, importance = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 2
## 
##         OOB estimate of  error rate: 2.86%
## Confusion matrix:
##            setosa versicolor virginica class.error
## setosa         37          0         0  0.00000000
## versicolor      0         34         1  0.02857143
## virginica       0          2        31  0.06060606

Reading the Output

The randomForest summary tells us:

OOB estimate of error rate: Cross-validation accuracy estimate using the ~37% of data each tree didn’t see during training — no separate validation set needed!
Confusion matrix: Per-class breakdown on OOB predictions.

# OOB error as a function of number of trees
oob_errors <- data.frame(
  n_trees = 1:500,
  OOB = rf_iris$err.rate[, "OOB"],
  setosa = rf_iris$err.rate[, "setosa"],
  versicolor = rf_iris$err.rate[, "versicolor"],
  virginica = rf_iris$err.rate[, "virginica"]
) %>%
  pivot_longer(-n_trees, names_to = "class", values_to = "error")

ggplot(oob_errors, aes(x = n_trees, y = error, color = class)) +
  geom_line(linewidth = 0.8, alpha = 0.8) +
  scale_color_manual(
    values = c(OOB = "black", setosa = "#E15759",
               versicolor = "#4E79A7", virginica = "#F28E2B")
  ) +
  labs(
    title = "OOB Error Rate Convergence",
    subtitle = "Error stabilizes after ~100–150 trees — diminishing returns beyond that",
    x = "Number of Trees", y = "OOB Error Rate", color = "Class"
  ) +
  theme_minimal()

How many trees do you need? More trees never hurt accuracy (unlike deeper single trees), but they do increase computation time. A common strategy: plot the OOB error curve and stop when it flattens.

3.2.1 Decision Boundary Evolution: 1 → 500 Trees

This is one of the most revealing visualizations for understanding Random Forests. A single Decision Tree creates hard, rectangular boundaries. As we add more trees, the ensemble’s boundary becomes smoother and more accurate — averaging out the idiosyncrasies of individual trees.

set.seed(MASTER_SEED)

# Use 2-feature iris for a clean 2D boundary plot
iris_2d_full <- iris[, c("Petal.Length", "Petal.Width", "Species")]
train_2d <- iris_2d_full[train_idx, ]
test_2d  <- iris_2d_full[-train_idx, ]

# Create a fine prediction grid
pl_rng <- range(iris$Petal.Length)
pw_rng <- range(iris$Petal.Width)
grid_2d <- expand.grid(
  Petal.Length = seq(pl_rng[1] - 0.2, pl_rng[2] + 0.2, length.out = 150),
  Petal.Width  = seq(pw_rng[1] - 0.1, pw_rng[2] + 0.1, length.out = 150)
)

# Build a large RF; we'll use predictions from the first k trees
rf_big <- randomForest(
  Species ~ .,
  data    = train_2d,
  ntree   = 500,
  mtry    = 1,
  importance = FALSE
)

# Boundary for a single decision tree
single_dt <- rpart(Species ~ ., data = train_2d, control = rpart.control(maxdepth = 5))
grid_2d$single_tree <- predict(single_dt, grid_2d, type = "class")

# RF predictions on grid for boundary regions
all_preds_grid <- predict(rf_big, grid_2d, predict.all = TRUE)$individual
# Separately predict on the actual test set (must NOT slice grid rows)
all_preds_test <- predict(rf_big, test_2d, predict.all = TRUE)$individual
# all_preds_grid: nrow(grid_2d) x 500; all_preds_test: nrow(test_2d) x 500

ntree_snapshots <- c(1, 5, 10, 50, 100, 500)
species_levels  <- levels(iris$Species)

snap_list <- lapply(ntree_snapshots, function(nt) {
  subset_preds <- all_preds_grid[, 1:nt, drop = FALSE]
  # $individual is a character matrix (class names), not 1/2/3 codes
  maj_class <- apply(subset_preds, 1, function(row) {
    tbl <- table(row)
    match(names(tbl)[which.max(tbl)], species_levels)
  })
  data.frame(
    Petal.Length = grid_2d$Petal.Length,
    Petal.Width  = grid_2d$Petal.Width,
    pred_class   = factor(species_levels[maj_class], levels = species_levels),
    ntree        = nt
  )
})
boundary_df <- do.call(rbind, snap_list)

palette_rf <- c(setosa = "#AED6F1", versicolor = "#A9DFBF", virginica = "#F9E79F")

plots_bd <- lapply(ntree_snapshots, function(nt) {
  bd_sub <- filter(boundary_df, ntree == nt)
  # Use test set predictions from rf_big (not the grid rows)
  maj_test <- apply(all_preds_test[, 1:nt, drop = FALSE], 1, function(row) {
    tbl <- table(row)
    names(tbl)[which.max(tbl)]
  })
  acc_nt <- mean(maj_test == as.character(test_2d$Species))

  ggplot() +
    geom_tile(data = bd_sub, aes(x = Petal.Length, y = Petal.Width, fill = pred_class),
              alpha = 0.65) +
    geom_point(data = train_2d,
               aes(x = Petal.Length, y = Petal.Width, color = Species, shape = Species),
               size = 2) +
    scale_fill_manual(values = palette_rf, guide = "none") +
    scale_color_manual(values = c(setosa = "#1A5276", versicolor = "#1E8449",
                                   virginica = "#922B21"), guide = "none") +
    labs(
      title    = sprintf("%d %s", nt, if (nt == 1) "Tree" else "Trees"),
      subtitle = sprintf("Test acc: %.1f%%", acc_nt * 100),
      x = "Petal Length", y = "Petal Width"
    ) +
    theme_minimal(base_size = 9) +
    theme(plot.title = element_text(face = "bold"))
})

do.call(grid.arrange, c(plots_bd, ncol = 3,
  top = "Decision Boundary Evolution: Single Tree → 500-Tree Random Forest\n(Background = predicted region; Points = training instances)"))

What to observe: - 1 tree: Jagged, axis-aligned boundaries — the tree memorizes noise. - 5–10 trees: Boundaries are still rough but visibly smoother. - 50–100 trees: Boundaries closely match the true class regions. - 500 trees: Very smooth — the “average” boundary has washed out all individual tree quirks.

3.3 Prediction and Evaluation

rf_pred <- predict(rf_iris, iris_test)
rf_acc  <- mean(rf_pred == iris_test$Species)

cat(sprintf("Random Forest Test Accuracy: %.1f%%\n", rf_acc * 100))

## Random Forest Test Accuracy: 91.1%

# Confusion matrix
table(Predicted = rf_pred, Actual = iris_test$Species)

##             Actual
## Predicted    setosa versicolor virginica
##   setosa         13          0         0
##   versicolor      0         13         2
##   virginica       0          2        15

# Class probability heatmap
prob_mat <- predict(rf_iris, iris_test, type = "prob")
prob_df <- as.data.frame(prob_mat) %>%
  mutate(obs = 1:n(), actual = iris_test$Species) %>%
  pivot_longer(cols = c(setosa, versicolor, virginica),
               names_to = "class", values_to = "prob")

ggplot(prob_df, aes(x = class, y = obs, fill = prob)) +
  geom_tile() +
  geom_text(aes(label = round(prob, 2)), size = 2.5) +
  facet_grid(actual ~ ., scales = "free_y", space = "free") +
  scale_fill_gradient(low = "white", high = "#4E79A7") +
  labs(
    title = "Random Forest: Predicted Class Probabilities (Test Set)",
    subtitle = "Each row is a test observation; columns are class membership probabilities",
    x = "Predicted Class", y = "Observation", fill = "Probability"
  ) +
  theme_minimal() +
  theme(strip.text.y = element_text(angle = 0))

3.4 How Individual Trees Vote: A Closer Look

For any given test observation, the RF prediction is simply the majority vote across all trees. But votes are rarely unanimous — some observations are confidently predicted while others are borderline. Let’s visualize the vote distribution for several test points.

set.seed(MASTER_SEED)

# Use the 500-tree full iris RF
all_test_votes <- predict(rf_iris, iris_test, predict.all = TRUE)$individual
# randomForest returns class NAMES (character matrix), not numeric 1/2/3

# Select 6 representative observations: 2 easy (pure), 2 moderate, 2 borderline
vote_summaries <- lapply(1:nrow(iris_test), function(i) {
  votes <- all_test_votes[i, ]
  tbl   <- table(votes)
  max_frac <- max(tbl) / ncol(all_test_votes)
  data.frame(obs = i, max_frac = max_frac,
             actual = as.character(iris_test$Species[i]))
}) %>% do.call(rbind, .) %>%
  arrange(max_frac)

# Pick 2 from each confidence tier
chosen_obs <- c(
  tail(vote_summaries$obs, 2),         # most confident (near 100%)
  vote_summaries$obs[c(                 # moderate
    floor(nrow(vote_summaries) * 0.4),
    floor(nrow(vote_summaries) * 0.6))],
  head(vote_summaries$obs, 2)           # least confident (borderline)
)

vote_plots <- lapply(seq_along(chosen_obs), function(idx) {
  i <- chosen_obs[idx]
  votes     <- all_test_votes[i, ]
  vote_df   <- as.data.frame(table(votes)) %>%
    mutate(Species = as.character(votes),
           Fraction = Freq / ncol(all_test_votes))
  true_sp   <- as.character(iris_test$Species[i])
  predicted <- names(which.max(table(votes)))
  correct   <- (predicted == true_sp)
  conf      <- max(table(votes)) / ncol(all_test_votes)

  ggplot(vote_df, aes(x = reorder(Species, -Fraction), y = Fraction, fill = Species)) +
    geom_col(width = 0.6) +
    geom_text(aes(label = scales::percent(Fraction, accuracy = 0.1)),
              vjust = -0.4, fontface = "bold", size = 3) +
    scale_fill_manual(values = c(setosa = "#1A5276", versicolor = "#1E8449",
                                  virginica = "#922B21"), guide = "none") +
    scale_y_continuous(labels = scales::percent, limits = c(0, 1.1)) +
    labs(
      title    = sprintf("Obs #%d  (Actual: %s)", i, true_sp),
      subtitle = sprintf("Pred: %s %s | Conf: %.0f%%",
                         predicted, ifelse(correct, "✓", "✗"), conf * 100),
      x = NULL, y = "Fraction of 500 Trees"
    ) +
    theme_minimal(base_size = 9) +
    theme(
      plot.title    = element_text(face = "bold"),
      plot.subtitle = element_text(color = ifelse(correct, "#1E8449", "#922B21"))
    )
})

do.call(grid.arrange, c(vote_plots, ncol = 3,
  top = "Per-Tree Vote Distribution for Selected Test Observations\n(High confidence: one class dominates; Low confidence: votes split across classes)"))

Reading the vote bars: - High confidence (top row): Nearly all 500 trees agree → the RF is very sure of its prediction. The confidence score maps directly to the predicted class probability. - Borderline cases (bottom row): Votes are split — multiple classes get substantial fractions. The RF’s probability estimate reflects genuine uncertainty, not overconfidence.

Part 4: Feature Importance

One of Random Forests’ most useful properties is automatic feature ranking. Scikit-Learn measures importance by how much each feature reduces node impurity (Gini or MSE), weighted by the proportion of training samples passing through that node, averaged across all trees [1].

\[\text{Importance}(X_j) = \frac{1}{B} \sum_{b=1}^{B} \sum_{t \in \mathcal{T}_b} \frac{n_t}{N} \cdot \Delta \text{impurity}(t, X_j)\]

where $B$ = number of trees, $\mathcal{T}_b$ = set of nodes in tree $b$ that split on $X_j$, $n_t$ = samples at node $t$, $N$ = total training samples.

In R’s randomForest, importance is available in two flavors: - MeanDecreaseAccuracy: how much accuracy drops when the feature is randomly permuted (OOB-based) - MeanDecreaseGini: total decrease in node impurity summed across trees

imp <- importance(rf_iris)
imp_df <- as.data.frame(imp) %>%
  rownames_to_column("Feature") %>%
  arrange(desc(MeanDecreaseGini))

p1 <- ggplot(imp_df, aes(x = reorder(Feature, MeanDecreaseGini),
                          y = MeanDecreaseGini, fill = MeanDecreaseGini)) +
  geom_col() +
  coord_flip() +
  scale_fill_viridis_c(option = "plasma", begin = 0.2, end = 0.85) +
  labs(title = "Feature Importance\n(Mean Decrease Gini)", x = NULL) +
  theme_minimal() +
  theme(legend.position = "none")

p2 <- ggplot(imp_df, aes(x = reorder(Feature, MeanDecreaseAccuracy),
                          y = MeanDecreaseAccuracy, fill = MeanDecreaseAccuracy)) +
  geom_col() +
  coord_flip() +
  scale_fill_viridis_c(option = "plasma", begin = 0.2, end = 0.85) +
  labs(title = "Feature Importance\n(Mean Decrease Accuracy)", x = NULL) +
  theme_minimal() +
  theme(legend.position = "none")

grid.arrange(p1, p2, ncol = 2,
  top = "Random Forest Feature Importance — Iris Dataset")

varImpPlot(rf_iris,
  main = "Variable Importance Plot (Random Forest — Iris)")

Interpretation: Petal features dominate. This matches the Decision Tree results from Week 02 — Petal.Length and Petal.Width are the most discriminative features for classifying Iris species.

Caution: Gini importance can be biased toward features with many unique values (high cardinality). Permutation importance (MeanDecreaseAccuracy) is more reliable but slower to compute.

4.1 Feature Usage Heatmap: Which Features Split Which Trees?

Beyond how important features are, we can ask: in which trees is each feature actually used for splitting? This heatmap shows how many times each feature is used across all splits in each tree — revealing how the random subset mechanism distributes feature usage across the forest.

set.seed(MASTER_SEED)
n_show_trees <- 60

# Feature usage per tree: count how many splits use each feature
feature_usage <- lapply(1:n_show_trees, function(k) {
  tree_k  <- getTree(rf_iris, k = k, labelVar = TRUE)
  splits  <- tree_k[tree_k$status == 1, "split var"]
  feats   <- names(iris)[-5]
  counts  <- sapply(feats, function(f) sum(splits == f))
  data.frame(Tree = k, Feature = feats, Count = counts)
}) %>% do.call(rbind, .)

ggplot(feature_usage, aes(x = Tree, y = Feature, fill = Count)) +
  geom_tile(color = "white", linewidth = 0.3) +
  scale_fill_gradient(low = "white", high = "#4E79A7",
                      name = "Split\nCount") +
  scale_x_continuous(breaks = seq(0, n_show_trees, 10)) +
  labs(
    title    = sprintf("Feature Usage Heatmap: First %d Trees of the Random Forest", n_show_trees),
    subtitle = "Darker = more splits using that feature in a tree. Random subsets → no feature dominates every tree.",
    x = "Tree Index", y = NULL
  ) +
  theme_minimal() +
  theme(panel.grid = element_blank())

Reading the heatmap: Each cell shows how many splits in tree $j$ used feature $i$. Dark columns mean that tree relied heavily on a particular feature. The variation across columns shows that random feature subsetting achieves its goal — different trees learn from different aspects of the data.

Part 4b: Bias-Variance Tradeoff — RF vs. Single Tree

One of the most fundamental results in statistical learning theory is the Bias-Variance tradeoff:

\[\text{Expected Error} = \text{Bias}^2 + \text{Variance} + \text{Irreducible Noise}\]

A single deep Decision Tree has low bias (it fits training data well) but high variance (it changes wildly with different training sets). Random Forests dramatically reduce variance while keeping bias roughly constant. We can demonstrate this empirically:

set.seed(MASTER_SEED)

# Simulate a regression task: y = sin(2*pi*x) + noise
n_train   <- 80
n_test    <- 200
n_repeats <- 50  # number of different training sets

x_test  <- seq(0, 1, length.out = n_test)
y_true  <- sin(2 * pi * x_test)

# Storage for predictions from each model across repetitions
preds_tree <- matrix(NA, nrow = n_test, ncol = n_repeats)
preds_rf3  <- matrix(NA, nrow = n_test, ncol = n_repeats)
preds_rf50 <- matrix(NA, nrow = n_test, ncol = n_repeats)

for (r in 1:n_repeats) {
  x_tr <- runif(n_train)
  y_tr <- sin(2 * pi * x_tr) + rnorm(n_train, sd = 0.3)
  df_tr <- data.frame(x = x_tr, y = y_tr)
  df_te <- data.frame(x = x_test)

  # Deep single tree
  t_m <- rpart(y ~ x, data = df_tr,
               control = rpart.control(maxdepth = 8, cp = 0, minsplit = 2))
  preds_tree[, r] <- predict(t_m, df_te)

  # RF with 3 trees (small — still high variance)
  rf3 <- randomForest(y ~ x, data = df_tr, ntree = 3)
  preds_rf3[, r] <- predict(rf3, df_te)

  # RF with 50 trees (converged)
  rf50 <- randomForest(y ~ x, data = df_tr, ntree = 50)
  preds_rf50[, r] <- predict(rf50, df_te)
}

# Compute bias^2 and variance for each model
bv_stats <- function(pred_mat, y_true) {
  mean_pred <- rowMeans(pred_mat)
  bias2     <- mean((mean_pred - y_true)^2)
  variance  <- mean(apply(pred_mat, 1, var))
  list(bias2 = bias2, variance = variance, total = bias2 + variance)
}

st_tree <- bv_stats(preds_tree, y_true)
st_rf3  <- bv_stats(preds_rf3,  y_true)
st_rf50 <- bv_stats(preds_rf50, y_true)

# Plot: mean prediction ± 1 SD ribbon for each model
make_ribbon_df <- function(pred_mat, label) {
  data.frame(
    x    = x_test,
    mean = rowMeans(pred_mat),
    sd   = apply(pred_mat, 1, sd),
    model = label
  )
}

ribbon_df <- bind_rows(
  make_ribbon_df(preds_tree, "Deep Single Tree"),
  make_ribbon_df(preds_rf3,  "RF (3 trees)"),
  make_ribbon_df(preds_rf50, "RF (50 trees)")
) %>%
  mutate(model = factor(model, levels = c("Deep Single Tree", "RF (3 trees)", "RF (50 trees)")))

cols_bv <- c("Deep Single Tree" = "#E15759", "RF (3 trees)" = "#F28E2B", "RF (50 trees)" = "#4E79A7")

p_ribbon <- ggplot(ribbon_df, aes(x = x, color = model, fill = model)) +
  geom_ribbon(aes(ymin = mean - sd, ymax = mean + sd), alpha = 0.15, color = NA) +
  geom_line(aes(y = mean), linewidth = 1.1) +
  geom_line(data = data.frame(x = x_test, y = y_true),
            aes(x = x, y = y), color = "black", linetype = "dashed",
            linewidth = 1, inherit.aes = FALSE) +
  scale_color_manual(values = cols_bv) +
  scale_fill_manual(values  = cols_bv) +
  labs(
    title    = "Bias-Variance Tradeoff: Mean Prediction ± 1 SD over 50 Training Sets",
    subtitle = "Black dashed = true function; Ribbon = variance across 50 different training datasets",
    x = "x", y = "y", color = NULL, fill = NULL
  ) +
  theme_minimal() +
  theme(legend.position = "bottom")

# Bar chart of bias^2 vs variance
bv_df <- data.frame(
  Model    = rep(c("Deep Single Tree", "RF (3 trees)", "RF (50 trees)"), each = 2),
  Component = rep(c("Bias²", "Variance"), 3),
  Value    = c(st_tree$bias2, st_tree$variance,
               st_rf3$bias2,  st_rf3$variance,
               st_rf50$bias2, st_rf50$variance)
) %>%
  mutate(Model = factor(Model, levels = c("Deep Single Tree", "RF (3 trees)", "RF (50 trees)")))

p_bar <- ggplot(bv_df, aes(x = Model, y = Value, fill = Component)) +
  geom_col(position = "stack", width = 0.55) +
  geom_text(aes(label = sprintf("%.4f", Value)),
            position = position_stack(vjust = 0.5),
            color = "white", fontface = "bold", size = 3.5) +
  scale_fill_manual(values = c("Bias²" = "#E15759", "Variance" = "#4E79A7")) +
  labs(
    title = "Bias² + Variance Decomposition",
    x = NULL, y = "Expected Error", fill = NULL
  ) +
  theme_minimal() +
  theme(legend.position = "bottom", axis.text.x = element_text(angle = 15, hjust = 1))

grid.arrange(p_ribbon, p_bar, ncol = 2,
  top = sprintf(
    "Empirical Bias-Variance: Deep Tree (total=%.4f) | RF-3 (total=%.4f) | RF-50 (total=%.4f)",
    st_tree$total, st_rf3$total, st_rf50$total))

The takeaway: - A deep single tree shows high variance (wide ribbon — each new training set gives very different predictions) with moderate bias. - RF with 3 trees reduces variance significantly but is still noisy. - RF with 50 trees has the lowest variance — the ribbon nearly collapses to the mean, and that mean closely tracks the true function. This is the power of ensemble averaging.

Part 5: Hyperparameter Tuning

5.1 Key Hyperparameters

Random Forest Hyperparameters in R’s `randomForest` Package
Parameter	Controls	Default	Tuning Tip
ntree	Number of trees in the forest	500	Plot OOB curve; stop when it flattens
mtry	Number of features sampled at each split	√p (classif.) / p/3 (regr.)	Try √p, √p/2, 2√p; tune via OOB
nodesize	Minimum size of terminal (leaf) nodes	1 (classif.) / 5 (regr.)	Larger → smaller trees (regularization)
maxnodes	Maximum number of leaf nodes per tree	Unlimited	Equivalent to max_depth control
sampsize	Size of bootstrap sample	N (with replacement)	Smaller → more variance reduction (more diversity)

5.2 Tuning `mtry`

set.seed(MASTER_SEED)
mtry_vals <- 1:4
oob_by_mtry <- sapply(mtry_vals, function(m) {
  rf_m <- randomForest(Species ~ ., data = iris_train, ntree = 300, mtry = m)
  rf_m$err.rate[300, "OOB"]
})

ggplot(data.frame(mtry = mtry_vals, oob = oob_by_mtry), aes(x = mtry, y = oob)) +
  geom_line(color = "#4E79A7", linewidth = 1.2) +
  geom_point(size = 3, color = "#E15759") +
  geom_text(aes(label = sprintf("%.3f", oob)), vjust = -0.8, fontface = "bold") +
  labs(
    title = "OOB Error Rate vs. mtry",
    subtitle = "Find the sweet spot — usually near √p",
    x = "mtry (features per split)", y = "OOB Error Rate"
  ) +
  scale_x_continuous(breaks = mtry_vals) +
  theme_minimal()

best_mtry <- mtry_vals[which.min(oob_by_mtry)]
cat("Best mtry:", best_mtry, "| OOB error:", round(min(oob_by_mtry), 4))

## Best mtry: 2 | OOB error: 0.0286

5.3 Effect of Tree Depth (nodesize)

set.seed(MASTER_SEED)
nodesizes <- c(1, 3, 5, 10, 20)
results_ns <- lapply(nodesizes, function(ns) {
  rf_ns <- randomForest(Species ~ ., data = iris_train,
                         ntree = 300, nodesize = ns, importance = FALSE)
  data.frame(
    nodesize = ns,
    oob_err  = rf_ns$err.rate[300, "OOB"],
    test_acc = mean(predict(rf_ns, iris_test) == iris_test$Species)
  )
})
results_ns <- do.call(rbind, results_ns)

ggplot(results_ns %>%
         pivot_longer(c(oob_err, test_acc), names_to = "metric", values_to = "value") %>%
         mutate(metric = recode(metric, oob_err = "OOB Error Rate",
                                        test_acc = "Test Accuracy")),
       aes(x = nodesize, y = value, color = metric)) +
  geom_line(linewidth = 1.2) +
  geom_point(size = 3) +
  scale_color_manual(values = c("OOB Error Rate" = "#E15759", "Test Accuracy" = "#4E79A7")) +
  labs(
    title = "Effect of nodesize on OOB Error and Test Accuracy",
    subtitle = "Larger nodesize = shallower, more regularized trees",
    x = "nodesize", y = "Value", color = NULL
  ) +
  scale_x_continuous(breaks = nodesizes) +
  theme_minimal() +
  theme(legend.position = "bottom")

Part 6: Extra-Trees — Even More Randomness

6.1 How Extra-Trees Differ

Extremely Randomized Trees (Extra-Trees) go one step further than Random Forests [1]:

Property	Random Forest	Extra-Trees
Feature subset per split	Random subset	Random subset
Split threshold	Optimal (searched exhaustively)	Random (drawn uniformly)
Training speed	Slower (best threshold search)	Faster
Bias	Lower	Slightly Higher
Variance	Higher	Lower

The random thresholds act as an additional regularizer, often preventing overfitting even better than Random Forests — especially on noisy datasets.

R’s rpart does not have a built-in splitter = "random" option (unlike scikit-learn). We can simulate Extra-Trees by manually building trees with random feature subsets and random thresholds at each split, using a simple recursive partitioning function.

set.seed(MASTER_SEED)
n_et <- 300

# Manual Extra-Tree builder: random feature + random threshold at each split
build_extra_tree <- function(data, target, features, max_depth = 6, min_size = 5, depth = 0) {
  labels <- data[[target]]
  if (depth >= max_depth || length(unique(labels)) == 1 || nrow(data) < min_size) {
    return(list(type = "leaf", class = names(which.max(table(labels)))))
  }
  # Random feature subset (mtry = sqrt(p))
  mtry <- max(1, floor(sqrt(length(features))))
  feat_sub <- sample(features, mtry)
  best_gini <- Inf; best_feat <- NULL; best_thresh <- NULL
  for (f in feat_sub) {
    vals <- data[[f]]
    # RANDOM threshold from the feature's range (the Extra-Trees key difference)
    thresh <- runif(1, min(vals), max(vals))
    left_idx  <- which(vals <= thresh)
    right_idx <- which(vals >  thresh)
    if (length(left_idx) < 2 || length(right_idx) < 2) next
    gini_l <- 1 - sum((table(labels[left_idx]) / length(left_idx))^2)
    gini_r <- 1 - sum((table(labels[right_idx]) / length(right_idx))^2)
    gini   <- (length(left_idx) * gini_l + length(right_idx) * gini_r) / nrow(data)
    if (gini < best_gini) {
      best_gini <- gini; best_feat <- f; best_thresh <- thresh
    }
  }
  if (is.null(best_feat)) {
    return(list(type = "leaf", class = names(which.max(table(labels)))))
  }
  left_data  <- data[data[[best_feat]] <= best_thresh, ]
  right_data <- data[data[[best_feat]] >  best_thresh, ]
  list(type = "split", feature = best_feat, threshold = best_thresh,
       left  = build_extra_tree(left_data,  target, features, max_depth, min_size, depth + 1),
       right = build_extra_tree(right_data, target, features, max_depth, min_size, depth + 1))
}

predict_extra_tree <- function(tree, newdata) {
  sapply(1:nrow(newdata), function(i) {
    node <- tree
    while (node$type == "split") {
      node <- if (newdata[[node$feature]][i] <= node$threshold) node$left else node$right
    }
    node$class
  })
}

features <- names(iris_train)[1:4]
et_preds <- matrix(NA, nrow = nrow(iris_test), ncol = n_et)

for (i in 1:n_et) {
  boot_i <- sample(1:nrow(iris_train), nrow(iris_train), replace = TRUE)
  et_tree <- build_extra_tree(iris_train[boot_i, ], "Species", features)
  et_preds[, i] <- predict_extra_tree(et_tree, iris_test)
}

et_acc_by_n <- sapply(1:n_et, function(k) {
  votes <- et_preds[, 1:k, drop = FALSE]
  final <- apply(votes, 1, function(row) names(which.max(table(row))))
  mean(final == as.character(iris_test$Species))
})

# RF baseline: predict with first k trees from rf_iris (must use predict.all)
all_rf_test_preds <- predict(rf_iris, iris_test, predict.all = TRUE)$individual
n_rf_trees <- ncol(all_rf_test_preds)

rf_acc_by_n <- sapply(1:n_et, function(k) {
  k_use <- min(k, n_rf_trees)
  votes <- all_rf_test_preds[, 1:k_use, drop = FALSE]
  final <- apply(votes, 1, function(row) names(which.max(table(row))))
  mean(final == as.character(iris_test$Species))
})

ggplot(data.frame(n = 1:n_et,
                  ExtraTrees = et_acc_by_n,
                  RandomForest = rf_acc_by_n) %>%
         pivot_longer(-n, names_to = "model", values_to = "acc"),
       aes(x = n, y = acc, color = model)) +
  geom_line(linewidth = 1, alpha = 0.8) +
  scale_color_manual(values = c(ExtraTrees = "#F28E2B", RandomForest = "#4E79A7")) +
  labs(
    title = "Extra-Trees vs. Random Forest: Test Accuracy by Number of Trees",
    subtitle = "Extra-Trees train faster but require more trees to converge",
    x = "Number of Trees", y = "Test Accuracy", color = NULL
  ) +
  coord_cartesian(ylim = c(0.85, 1.0)) +
  theme_minimal() +
  theme(legend.position = "bottom")

Part 7: Random Forests for Regression

Random Forests work equally well for regression tasks. Instead of voting on class labels, trees average their numerical predictions. Node impurity is measured by MSE instead of Gini/Entropy.

\[\hat{y}(x) = \frac{1}{B} \sum_{b=1}^{B} T_b(x)\]

data(mtcars)
set.seed(MASTER_SEED)

# Predict mpg from car attributes
tr_idx <- sample(1:nrow(mtcars), 22)
mtcars_train <- mtcars[tr_idx, ]
mtcars_test  <- mtcars[-tr_idx, ]

# Compare: single tree vs random forest
single_reg <- rpart(mpg ~ ., data = mtcars_train, control = rpart.control(maxdepth = 3))
rf_reg     <- randomForest(mpg ~ ., data = mtcars_train, ntree = 500, importance = TRUE)

pred_single <- predict(single_reg, mtcars_test)
pred_rf     <- predict(rf_reg,     mtcars_test)

rmse <- function(y, yhat) sqrt(mean((y - yhat)^2))

cat(sprintf("Single Tree RMSE : %.3f mpg\n", rmse(mtcars_test$mpg, pred_single)))

## Single Tree RMSE : 3.755 mpg

cat(sprintf("Random Forest RMSE: %.3f mpg\n", rmse(mtcars_test$mpg, pred_rf)))

## Random Forest RMSE: 2.297 mpg

# Side-by-side predicted vs actual
plot_df <- data.frame(
  actual     = mtcars_test$mpg,
  tree_pred  = pred_single,
  rf_pred    = pred_rf
)

p_tree <- ggplot(plot_df, aes(x = actual, y = tree_pred)) +
  geom_abline(slope = 1, intercept = 0, linetype = "dashed", color = "grey50") +
  geom_point(color = "#E15759", size = 3, alpha = 0.8) +
  labs(title = "Single Regression Tree",
       subtitle = sprintf("RMSE = %.2f mpg", rmse(mtcars_test$mpg, pred_single)),
       x = "Actual MPG", y = "Predicted MPG") +
  theme_minimal()

p_rf <- ggplot(plot_df, aes(x = actual, y = rf_pred)) +
  geom_abline(slope = 1, intercept = 0, linetype = "dashed", color = "grey50") +
  geom_point(color = "#4E79A7", size = 3, alpha = 0.8) +
  labs(title = "Random Forest Regression",
       subtitle = sprintf("RMSE = %.2f mpg", rmse(mtcars_test$mpg, pred_rf)),
       x = "Actual MPG", y = "Predicted MPG") +
  theme_minimal()

grid.arrange(p_tree, p_rf, ncol = 2,
  top = "Single Tree vs. Random Forest — mtcars Regression")

imp_reg <- importance(rf_reg)
imp_reg_df <- as.data.frame(imp_reg) %>%
  rownames_to_column("Feature") %>%
  arrange(desc(`%IncMSE`))

ggplot(imp_reg_df, aes(x = reorder(Feature, `%IncMSE`), y = `%IncMSE`, fill = `%IncMSE`)) +
  geom_col() +
  coord_flip() +
  scale_fill_viridis_c(option = "plasma", begin = 0.2, end = 0.85) +
  labs(
    title = "Feature Importance for MPG Prediction (Random Forest)",
    subtitle = "% Increase in MSE when feature is randomly permuted",
    x = NULL, y = "% Increase in MSE"
  ) +
  theme_minimal() +
  theme(legend.position = "none")

Part 8: Boosting — Sequential Ensemble Methods

Bagging and Random Forests build trees in parallel on independent bootstrap samples. Boosting takes a fundamentally different approach: trees are built sequentially, each one trying to correct the errors of the previous one [1].

8.1 AdaBoost — Adaptive Boosting

AdaBoost focuses subsequent learners on the hard examples (those the previous learner got wrong) by re-weighting training instances.

Algorithm:

Initialize all instance weights: $w_i = \frac{1}{m}$
Train a weak learner, compute weighted error $r_j$
Compute learner weight: $\alpha_j = \eta \log\left(\frac{1 - r_j}{r_j}\right)$
Boost misclassified instance weights: $w_i \leftarrow w_i \cdot \exp(\alpha_j)$
Normalize weights, repeat

# Simulate AdaBoost weight evolution
set.seed(MASTER_SEED)
n_obs <- 20
true_labels <- rep(c(1, -1), each = 10)
weights <- rep(1/n_obs, n_obs)

weight_history <- matrix(NA, nrow = 5, ncol = n_obs)
weight_history[1, ] <- weights

for (round in 1:4) {
  # Weighted sampling: higher-weight instances are more likely to be misclassified
  # (weak learner trained on weighted data still gets some hard cases wrong)
  n_wrong <- sample(2:5, 1)
  wrong <- sample(1:n_obs, n_wrong, prob = weights, replace = FALSE)
  r_j <- sum(weights[wrong])
  r_j <- max(min(r_j, 0.49), 0.01)
  alpha_j <- log((1 - r_j) / r_j)
  weights[wrong] <- weights[wrong] * exp(alpha_j)
  weights <- weights / sum(weights)
  weight_history[round + 1, ] <- weights
}

weight_df <- as.data.frame(weight_history) %>%
  mutate(round = 0:4) %>%
  pivot_longer(-round, names_to = "obs", values_to = "weight") %>%
  mutate(obs = as.integer(gsub("V", "", obs)))

ggplot(weight_df, aes(x = obs, y = weight, fill = factor(round))) +
  geom_col(position = "dodge") +
  facet_wrap(~paste("Round", round), nrow = 1) +
  scale_fill_viridis_d(option = "plasma", begin = 0.2, end = 0.85) +
  labs(
    title = "AdaBoost: Instance Weights Across Rounds",
    subtitle = "Misclassified instances get higher weights — harder examples dominate future learners",
    x = "Training Instance", y = "Weight"
  ) +
  theme_minimal() +
  theme(legend.position = "none", axis.text.x = element_blank())

8.2 Gradient Boosting

Gradient Boosting is more general than AdaBoost. Instead of re-weighting instances, it fits each new tree to the residual errors (pseudo-residuals) of the previous ensemble [1].

\[F_m(x) = F_{m-1}(x) + \eta \cdot h_m(x)\]

where $h_m$ is a weak learner fit to the negative gradient of the loss function.

The learning_rate ($\eta$) scales each tree’s contribution — lower = more trees needed, but better generalization (shrinkage regularization).

set.seed(MASTER_SEED)

# Simulate a noisy quadratic function
n_pts <- 200
X_sim <- runif(n_pts, -3, 3)
y_sim <- X_sim^2 - 2 * X_sim + rnorm(n_pts, sd = 1.5)
sim_df <- data.frame(x = X_sim, y = y_sim)
train_sim <- sim_df[1:150, ]
test_sim  <- sim_df[151:200, ]

# Manual gradient boosting (MSE loss, residuals = negative gradient)
lr <- 0.3
trees <- list()
F_pred_train <- rep(mean(train_sim$y), nrow(train_sim))
F_pred_test  <- rep(mean(train_sim$y), nrow(test_sim))

stages <- c(1, 3, 5, 10, 20, 50)
stage_preds <- list()

for (m in 1:max(stages)) {
  residuals_m <- train_sim$y - F_pred_train
  tree_m <- rpart(residuals_m ~ x, data = train_sim,
                   control = rpart.control(maxdepth = 2, minsplit = 5))
  trees[[m]] <- tree_m
  F_pred_train <- F_pred_train + lr * predict(tree_m, train_sim)
  F_pred_test  <- F_pred_test  + lr * predict(tree_m, test_sim)
  if (m %in% stages) {
    stage_preds[[as.character(m)]] <- F_pred_test
  }
}

x_seq <- seq(-3, 3, length.out = 200)
plot_list <- lapply(names(stage_preds), function(s) {
  pred_vals <- stage_preds[[s]]
  rmse_s <- sqrt(mean((test_sim$y - pred_vals)^2))
  
  pred_curve_y <- rep(mean(train_sim$y), 200)
  F_curve <- rep(mean(train_sim$y), 200)
  for (t_m in trees[1:as.integer(s)]) {
    F_curve <- F_curve + lr * predict(t_m, data.frame(x = x_seq))
  }
  
  ggplot(train_sim, aes(x = x, y = y)) +
    geom_point(alpha = 0.3, size = 1.5, color = "grey50") +
    geom_line(data = data.frame(x = x_seq, y = F_curve),
              aes(x = x, y = y), color = "#4E79A7", linewidth = 1.2) +
    stat_function(fun = function(x) x^2 - 2*x, color = "#E15759",
                  linetype = "dashed", linewidth = 1) +
    labs(
      title = paste0(s, ifelse(s == 1, " tree", " trees")),
      subtitle = sprintf("Test RMSE: %.2f", rmse_s),
      x = "x", y = "y"
    ) +
    coord_cartesian(ylim = c(-5, 10)) +
    theme_minimal(base_size = 9)
})

do.call(grid.arrange, c(plot_list, ncol = 3,
  top = "Gradient Boosting: Fitting Residuals Stage by Stage\n(Blue = Ensemble, Red Dashed = True Function)"))

8.3 Gradient Boosting in R with `gbm`

set.seed(MASTER_SEED)
iris_gbm <- iris
iris_gbm$Species_bin <- ifelse(iris$Species == "virginica", 1, 0)

gbm_fit <- gbm(
  Species_bin ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width,
  data         = iris_gbm[train_idx, ],
  distribution = "bernoulli",
  n.trees      = 200,
  interaction.depth = 3,
  shrinkage    = 0.1,
  cv.folds     = 5,
  verbose      = FALSE
)

best_iter <- gbm.perf(gbm_fit, method = "cv", plot.it = FALSE)
cat("Best number of trees (CV):", best_iter)

## Best number of trees (CV): 43

# Feature importance from GBM
imp_gbm <- summary(gbm_fit, n.trees = best_iter, plotit = FALSE)
ggplot(imp_gbm, aes(x = reorder(var, rel.inf), y = rel.inf, fill = rel.inf)) +
  geom_col() +
  coord_flip() +
  scale_fill_viridis_c(option = "plasma", begin = 0.2, end = 0.85) +
  labs(
    title = "GBM Feature Importance (virginica vs. other)",
    subtitle = "Relative influence = % reduction in loss from each feature",
    x = NULL, y = "Relative Influence (%)"
  ) +
  theme_minimal() +
  theme(legend.position = "none")

8.4 Comparing Methods

set.seed(MASTER_SEED)
# 5-fold CV comparison on iris (versicolor vs. other, binary)
iris_bin_df <- iris
iris_bin_df$target <- factor(ifelse(iris$Species == "versicolor", "yes", "no"))

folds <- cut(sample(1:150), breaks = 5, labels = FALSE)

cv_acc <- sapply(1:5, function(fold) {
  tr  <- iris_bin_df[folds != fold, ]
  te  <- iris_bin_df[folds == fold, ]

  # Single tree
  t_m <- rpart(target ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width,
               data = tr, control = rpart.control(maxdepth = 4))
  acc_tree <- mean(predict(t_m, te, type = "class") == te$target)

  # Random Forest
  rf_m <- randomForest(target ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width,
                        data = tr, ntree = 200)
  acc_rf <- mean(predict(rf_m, te) == te$target)

  c(tree = acc_tree, rf = acc_rf)
})

cv_results <- data.frame(
  Model    = c("Decision Tree", "Random Forest"),
  Mean_Acc = rowMeans(cv_acc),
  SD_Acc   = apply(cv_acc, 1, sd)
)

knitr::kable(cv_results, digits = 4,
  col.names = c("Model", "Mean CV Accuracy", "SD"),
  caption = "5-Fold CV Accuracy: Single Tree vs. Random Forest")

5-Fold CV Accuracy: Single Tree vs. Random Forest
	Model	Mean CV Accuracy	SD
tree	Decision Tree	0.9267	0.0149
rf	Random Forest	0.9467	0.0183

Part 9: Advantages, Disadvantages & When to Use What

9.1 Comparison Table

Ensemble Method Comparison
Property	Decision Tree	Random Forest	Extra-Trees	Gradient Boosting
Training speed	Fast	Moderate	Fast ✓	Slow (sequential)
Prediction speed	Fast	Moderate	Moderate	Fast
Parallelizable	N/A	Yes ✓	Yes ✓	Partial
Handles overfitting	Poor (needs pruning)	Good ✓	Better ✓	Good (shrinkage)
Interpretable	High ✓	Low ✗	Low ✗	Low ✗
Feature importance	Yes (limited)	Yes ✓	Yes ✓	Yes ✓
Works on small data	OK	OK	OK	OK
Handles missing values	Some (surrogates)	Some (OOB)	Some	Some
Hyperparameter sensitivity	Low	Low-Medium	Low	High !

9.2 When to Choose What

Practical Guide: When to Use Which Ensemble
Scenario	Recommended Method
You need to explain the model to stakeholders	Single Decision Tree (or shallow RF)
You want the best out-of-the-box performance	Random Forest
Training time is limited	Extra-Trees (random thresholds = faster)
You have very noisy data	Random Forest (variance reduction handles noise)
You need to rank features	Random Forest (permutation importance is robust)
You want state-of-the-art on tabular data	Gradient Boosting / XGBoost (wins Kaggle)

Part 10: Stacking (Blending)

Stacking (stacked generalization) replaces the simple voting/averaging aggregation with a trained meta-learner [1]:

Split training data into K folds.
Train several diverse base learners on K-1 folds; predict on the held-out fold.
Collect all OOF (out-of-fold) predictions → create a meta-feature matrix.
Train a meta-learner (e.g., linear model, another RF) on the meta-features.

set.seed(MASTER_SEED)
data(iris)
iris_bin2 <- iris
iris_bin2$target <- factor(ifelse(iris$Species == "virginica", 1, 0))

n <- nrow(iris_bin2)
k <- 5
fold_ids <- sample(rep(1:k, length.out = n))

meta_features <- matrix(NA, nrow = n, ncol = 2)
colnames(meta_features) <- c("tree_prob", "rf_prob")

for (f in 1:k) {
  tr <- iris_bin2[fold_ids != f, ]
  te <- iris_bin2[fold_ids == f, ]

  t_m <- rpart(target ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width,
               data = tr, control = rpart.control(maxdepth = 3))
  rf_m <- randomForest(target ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width,
                        data = tr, ntree = 100)

  meta_features[fold_ids == f, "tree_prob"] <- predict(t_m, te, type = "prob")[, "1"]
  meta_features[fold_ids == f, "rf_prob"]   <- predict(rf_m, te, type = "prob")[, "1"]
}

meta_df <- as.data.frame(meta_features)
meta_df$target <- iris_bin2$target

ggplot(meta_df, aes(x = tree_prob, y = rf_prob, color = target)) +
  geom_point(size = 3, alpha = 0.8) +
  geom_abline(slope = 1, intercept = 0, linetype = "dashed", color = "grey60") +
  scale_color_manual(values = c("0" = "#4E79A7", "1" = "#E15759"),
                     labels = c("0" = "Other", "1" = "Virginica")) +
  labs(
    title = "Stacking: Out-of-Fold Meta-Features",
    subtitle = "Each point = one training example. Meta-learner learns from these predictions.",
    x = "Decision Tree: P(virginica)", y = "Random Forest: P(virginica)",
    color = "True Class"
  ) +
  theme_minimal()

Note: The meta-features from each base learner become the inputs to the meta-learner (blender). In practice, you’d train a simple logistic regression or another RF on meta_df to produce the final ensemble prediction.

Knowledge Check

Question 1: Why does averaging many trees reduce variance but not bias?

Answer: Each tree is an unbiased estimator of the true function (in expectation). Averaging $B$ independent estimates: $\text{Var}(\bar{y}) = \text{Var}(y)/B$ — variance shrinks with more trees, but the average of unbiased estimators is still unbiased. In practice, trees are positively correlated (same data), so the variance reduction is $\rho\sigma^2 + \frac{1-\rho}{B}\sigma^2$ — correlation $\rho$ limits reduction. Random feature subsets reduce $\rho$, hence further reduce variance [2].

Question 2: You have trained 5 models all achieving 95% precision on the same training data. Can combining them improve results?

Answer: Possibly, but only if they make different errors. If trained on the same data with similar algorithms, they will likely make the same mistakes, and voting won’t help. Diversity is key. To create useful ensembles: use different algorithms, train on different data subsets (bagging), or use different feature subsets. If all 5 are identical, the ensemble is no better than a single model [1].

Question 3: What is Out-of-Bag evaluation and why is it useful?

Answer: Each bootstrap sample leaves out ~37% of training instances (OOB instances). Each tree can be evaluated on its own OOB instances — data it never saw during training. Averaging these evaluations gives an unbiased estimate of generalization error without needing a separate validation set. This means you can use 100% of your data for training while still getting a cross-validation-like estimate [1].

Question 4: What makes Extra-Trees more random than Random Forests, and when might they be preferred?

Answer: Random Forests randomly select features at each node but still search for the optimal threshold for the chosen features. Extra-Trees also draw thresholds randomly from the feature’s range. This extreme randomness: (1) further decorrelates trees → lower variance, (2) eliminates expensive threshold search → much faster training. Extra-Trees may outperform RF when data is very noisy (more regularization helps) and are preferred when training speed matters [1].

Question 5: When should you use Gradient Boosting instead of Random Forest?

Answer: Gradient Boosting often achieves higher predictive accuracy on structured/tabular data — it’s the go-to method for Kaggle competitions. However, it (1) trains sequentially and cannot be easily parallelized, (2) is more sensitive to hyperparameters (learning rate, n_estimators, max_depth), and (3) can overfit if not regularized. Use GBM/XGBoost when accuracy is paramount and you can afford tuning time. Use Random Forest when you need a reliable, fast, and robust baseline [2].

Question 6: Can a Random Forest overfit? How do you detect and fix it?

Answer: Yes, but it’s harder to overfit than a single tree. Signs: large gap between OOB error and training accuracy. Fixes: (1) increase nodesize (larger minimum leaf → shallower trees), (2) decrease maxnodes (limit leaf count), (3) increase mtry (consider more features, more competition → less overfitting in some cases), (4) reduce ntree slightly if computation allows, (5) use OOB error curve to confirm trees have converged [1][2].

Summary

In this tutorial, we built from the instability problem of a single Decision Tree to a full ensemble framework:

Concept	Key Idea	R Tool
Voting	Aggregate diverse models	Manual + `randomForest`
Bagging	Bootstrap + average	Manual loop + `rpart`
Random Forest	Bagging + random features	`randomForest()`
OOB Evaluation	Free cross-validation	`rf$err.rate`
Feature Importance	Impurity or permutation	`importance()`, `varImpPlot()`
Extra-Trees	Random thresholds → faster	`rpart(splitter="random")`
Gradient Boosting	Sequential residual fitting	`gbm()`
Stacking	Meta-learner on OOF preds	Manual k-fold loop

The Big Picture: Random Forests are one of the most reliable out-of-the-box algorithms in machine learning — they handle high dimensionality, mixed feature types, and noisy data with minimal hyperparameter tuning. Understanding them deeply (bagging, decorrelation, OOB, importance) builds the foundation for understanding gradient boosting and other advanced ensembles.

References

[1] A. Géron, Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow, 2nd ed. Sebastopol, CA: O’Reilly Media, 2019, ch. 7.

[2] B. C. Boehmke and B. M. Greenwell, Hands-On Machine Learning with R. Boca Raton, FL: CRC Press, 2020, ch. 10–11. [Online]. Available: https://bradleyboehmke.github.io/HOML/

[3] L. Breiman, “Bagging Predictors,” Machine Learning, vol. 24, no. 2, pp. 123–140, 1996.

[4] L. Breiman, “Random Forests,” Machine Learning, vol. 45, no. 1, pp. 5–32, 2001.

[5] P. Geurts, D. Ernst, and L. Wehenkel, “Extremely Randomized Trees,” Machine Learning, vol. 63, no. 1, pp. 3–42, 2006.

[6] Y. Freund and R. E. Schapire, “A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting,” Journal of Computer and System Sciences, vol. 55, no. 1, pp. 119–139, 1997.

[7] J. H. Friedman, “Greedy Function Approximation: A Gradient Boosting Machine,” Annals of Statistics, vol. 29, no. 5, pp. 1189–1232, 2001.

[8] A. Liaw and M. Wiener, “Classification and Regression by randomForest,” R News, vol. 2, no. 3, pp. 18–22, 2002. [Online]. Available: https://CRAN.R-project.org/package=randomForest

[9] Z. Huang, “seedhash: Deterministic Seed Generation from String Inputs Using MD5 Hashing,” R package, 2026. [Online]. Available: https://github.com/melhzy/seedhash

Week 03: Ensemble Learning & Random Forests

Ziyuan Huang

Last Updated: April 02, 2026