library(seedhash)
gen <- SeedHashGenerator$new("ANLY530-Week03-RandomForest")
MASTER_SEED <- gen$generate_seeds(1)
set.seed(MASTER_SEED)Random Seed Source: seedhash package —
input string: "ANLY530-Week03-RandomForest" → seed:
694647825
Last week, we learned that a single Decision Tree — while intuitive — is unstable: a small change in the data can produce a completely different tree. This week, we fix that by combining many trees into a forest.
“If you aggregate the predictions of a group of predictors (such as classifiers or regressors), you will often get better predictions than with the best individual predictor. A group of predictors is called an ensemble; thus, this technique is called Ensemble Learning, and an Ensemble Learning algorithm is called an Ensemble method.” — A. Géron [1]
“Random forests are a modification of bagged decision trees that build a large collection of de-correlated trees to further improve predictive performance.” — B. C. Boehmke & B. M. Greenwell [2]
This is the wisdom of the crowd in action. Thousands of people independently guessing the weight of an ox will, on average, be surprisingly accurate — more accurate than any single expert. The same principle makes ensembles so powerful.
Learning Objectives:
By the end of this tutorial, you will be able to:
Suppose you flip a slightly biased coin that lands heads 51% of the time. Flip it 1,000 times — you’ll get roughly 510 heads (majority heads). Flip it 10,000 times — the probability of a majority of heads climbs to over 97%. This is the Law of Large Numbers in action.
The same logic applies to classifiers. If each classifier is correct 51% of the time (barely better than random), but each makes different mistakes, combining 1,000 of them can yield 75% accuracy!
The key requirement: classifiers must be as independent as possible (i.e., make different types of errors).
simulate_ensemble_accuracy <- function(n_classifiers, base_acc) {
# P(majority correct) for n independent classifiers each with P(correct) = base_acc
# Using binomial distribution: P(k > n/2 correct out of n)
k_needed <- floor(n_classifiers / 2) + 1
sum(dbinom(k_needed:n_classifiers, n_classifiers, base_acc))
}
classifiers <- c(1, 3, 5, 11, 21, 51, 101, 501, 1001)
results <- data.frame(
n = classifiers,
acc_55 = sapply(classifiers, simulate_ensemble_accuracy, base_acc = 0.55),
acc_60 = sapply(classifiers, simulate_ensemble_accuracy, base_acc = 0.60),
acc_70 = sapply(classifiers, simulate_ensemble_accuracy, base_acc = 0.70)
) %>%
pivot_longer(-n, names_to = "base", values_to = "ensemble_acc") %>%
mutate(base = recode(base,
acc_55 = "55% base accuracy",
acc_60 = "60% base accuracy",
acc_70 = "70% base accuracy"
))
ggplot(results, aes(x = n, y = ensemble_acc, color = base)) +
geom_line(linewidth = 1.2) +
geom_point(size = 2) +
geom_hline(yintercept = 0.9, linetype = "dashed", color = "grey40") +
scale_x_log10(breaks = classifiers, labels = as.character(classifiers)) +
scale_color_viridis_d(option = "plasma", begin = 0.2, end = 0.85) +
labs(
title = "Ensemble Accuracy vs. Number of Independent Classifiers",
subtitle = "The more classifiers, the closer to certainty — given they make independent errors",
x = "Number of Classifiers (log scale)",
y = "Ensemble Accuracy",
color = "Base Classifier",
caption = "Dashed line = 90% accuracy threshold"
) +
theme_minimal() +
theme(legend.position = "bottom")Key Insight: Even a weak 55% classifier, when combined with 501 others (making independent errors), reaches ~90% ensemble accuracy. The diversity of errors — not the strength of any individual classifier — is the key.
There are two ways to aggregate predictions in a voting classifier:
| Method | Mechanism | Requires |
|---|---|---|
| Hard Voting | Take the most frequent predicted class (majority vote) | Only class labels |
| Soft Voting | Average predicted probabilities, pick highest | predict_proba() for each model |
Soft voting often beats hard voting because it gives more weight to high-confidence predictions. A classifier that says “99% Class A” counts for more than one that says “51% Class A”.
Let’s simulate this on the Iris dataset using three diverse base learners:
data(iris)
# Use sepal features only — deliberately harder (versicolor/virginica overlap here)
# Petal features are too perfectly separable → all classifiers agree on everything → no
# visible difference between individual and ensemble, which defeats the tutorial's point.
iris_vote <- iris[, c("Sepal.Length", "Sepal.Width", "Species")]
# Train/test split (70/30)
set.seed(MASTER_SEED)
idx <- sample(1:nrow(iris_vote), size = floor(0.7 * nrow(iris_vote)))
train <- iris_vote[idx, ]
test <- iris_vote[-idx, ]
# --- Three diverse base learners ---
# 1. Shallow Decision Tree (high bias, low variance)
tree_model <- rpart(Species ~ ., data = train, control = rpart.control(maxdepth = 2))
# 2. Random Forest (low bias, low variance — strong learner)
rf_model <- randomForest(Species ~ ., data = train, ntree = 200)
# 3. LDA (Linear Discriminant Analysis) — linear boundary, different error pattern
lda_model <- MASS::lda(Species ~ ., data = train)
# --- Individual predictions ---
pred_tree <- predict(tree_model, test, type = "class")
pred_rf <- predict(rf_model, test, type = "class")
pred_lda <- predict(lda_model, test)$class
# --- Hard Voting: true majority vote across 3 classifiers ---
hard_vote_mat <- data.frame(
tree = as.character(pred_tree),
rf = as.character(pred_rf),
lda = as.character(pred_lda)
)
hard_vote <- apply(hard_vote_mat, 1, function(row) names(which.max(table(row))))
hard_vote <- factor(hard_vote, levels = levels(test$Species))
# --- Soft Voting: average class probabilities across 3 classifiers ---
prob_tree <- predict(tree_model, test, type = "prob")
prob_rf <- predict(rf_model, test, type = "prob")
prob_lda <- predict(lda_model, test)$posterior
avg_prob <- (prob_tree + prob_rf + prob_lda) / 3
soft_vote <- factor(colnames(avg_prob)[apply(avg_prob, 1, which.max)],
levels = levels(test$Species))
# --- Accuracy ---
acc_tree <- mean(pred_tree == test$Species)
acc_rf <- mean(pred_rf == test$Species)
acc_lda <- mean(pred_lda == test$Species)
acc_hard <- mean(hard_vote == test$Species)
acc_soft <- mean(soft_vote == test$Species)
acc_df <- data.frame(
Model = c("Decision Tree", "LDA", "Random Forest", "Hard Vote (3-way)", "Soft Vote (3-way)"),
Accuracy = c(acc_tree, acc_lda, acc_rf, acc_hard, acc_soft)
)
ggplot(acc_df, aes(x = reorder(Model, Accuracy), y = Accuracy, fill = Model)) +
geom_col(width = 0.6) +
geom_text(aes(label = scales::percent(Accuracy, accuracy = 0.1)),
hjust = -0.1, fontface = "bold") +
coord_flip(ylim = c(0.4, 1.05)) +
scale_fill_viridis_d(option = "plasma", begin = 0.2, end = 0.85) +
labs(
title = "Classifier Accuracy on Iris Test Set (Sepal Features Only)",
subtitle = "Voting classifiers combine the strengths of individual models — sepal features make this a harder task",
x = NULL, y = "Accuracy"
) +
theme_minimal() +
theme(legend.position = "none")Instead of training different algorithms on the same data, bagging (Bootstrap AGGregating) trains the same algorithm on different random bootstrapped subsets of the training data.
When bootstrapping a dataset of size \(m\) with replacement, the probability that any single instance is not selected in one draw is \(\frac{m-1}{m}\). After \(m\) draws:
\[P(\text{not selected in any draw}) = \left(1 - \frac{1}{m}\right)^m \xrightarrow{m \to \infty} e^{-1} \approx 0.368\]
So each bootstrap sample leaves out about 37% of instances — these are the Out-of-Bag (OOB) instances.
The heatmap below shows a small example (20 training instances, 10 trees). Each cell is colored blue if instance \(i\) is in the bootstrap sample for tree \(j\), and orange if it is OOB (held-out for that tree).
set.seed(MASTER_SEED)
n_inst <- 20 # training instances
n_tr <- 10 # trees to show
# Build in-bag (1) vs OOB (0) matrix
inbag_mat <- matrix(0L, nrow = n_inst, ncol = n_tr)
for (j in 1:n_tr) {
selected <- sample(1:n_inst, n_inst, replace = TRUE)
inbag_mat[unique(selected), j] <- 1L
}
# Compute each instance's OOB count
oob_counts <- rowSums(inbag_mat == 0)
inbag_df <- as.data.frame(inbag_mat) %>%
mutate(Instance = 1:n_inst) %>%
pivot_longer(-Instance, names_to = "Tree", values_to = "Status") %>%
mutate(
Tree = factor(as.integer(gsub("V", "", Tree))),
Status = factor(Status, levels = c(0, 1), labels = c("OOB (held-out)", "In-Bag (training)"))
)
ggplot(inbag_df, aes(x = Tree, y = factor(Instance), fill = Status)) +
geom_tile(color = "white", linewidth = 0.5) +
scale_fill_manual(values = c("OOB (held-out)" = "#F28E2B", "In-Bag (training)" = "#4E79A7")) +
labs(
title = "Bootstrap Sample Composition Across Trees",
subtitle = sprintf("N=%d instances, %d trees — orange cells = OOB (never seen by that tree)",
n_inst, n_tr),
x = "Tree Index", y = "Training Instance",
fill = NULL
) +
theme_minimal() +
theme(legend.position = "top",
axis.text.y = element_text(size = 8),
panel.grid = element_blank())Key observations: - No instance is OOB for all trees — every instance is seen by most trees. - On average, each tree sees ~63% of instances (blue) and skips ~37% (orange). - Each tree’s OOB instances form its private validation set. Averaging OOB predictions across all trees gives the OOB error estimate — free cross-validation with no extra data!
m_values <- seq(2, 500, by = 5)
oob_frac <- (1 - 1 / m_values)^m_values
ggplot(data.frame(m = m_values, oob = oob_frac), aes(x = m, y = oob)) +
geom_line(color = "#4E79A7", linewidth = 1.2) +
geom_hline(yintercept = exp(-1), linetype = "dashed", color = "#E15759") +
annotate("text", x = 400, y = exp(-1) + 0.015,
label = sprintf("e⁻¹ ≈ %.3f", exp(-1)),
color = "#E15759", fontface = "bold") +
labs(
title = "Fraction of OOB Instances vs. Training Set Size",
subtitle = "Converges quickly to 36.8% — a free validation set!",
x = "Training Set Size (m)", y = "OOB Fraction"
) +
theme_minimal()set.seed(MASTER_SEED)
n_trees <- 200
# Binary classification: versicolor vs virginica — deliberately hard (real sepal overlap)
# Setosa vs other is nearly perfectly separable → all bootstrap trees identical → flat line
iris_bin <- iris %>%
filter(Species %in% c("versicolor", "virginica")) %>%
mutate(label = droplevels(Species)) %>%
select(Sepal.Length, Sepal.Width, label)
idx_tr <- sample(1:nrow(iris_bin), 70)
train_bin <- iris_bin[idx_tr, ]
test_bin <- iris_bin[-idx_tr, ]
# Train bagged ensemble manually
bag_preds <- matrix(NA, nrow = nrow(test_bin), ncol = n_trees)
for (i in 1:n_trees) {
boot_idx <- sample(1:nrow(train_bin), nrow(train_bin), replace = TRUE)
boot_data <- train_bin[boot_idx, ]
tree_i <- rpart(label ~ ., data = boot_data, control = rpart.control(maxdepth = 4))
bag_preds[, i] <- as.character(predict(tree_i, test_bin, type = "class"))
}
# Hard voting: majority class at each k trees
ensemble_acc <- sapply(1:n_trees, function(k) {
votes <- bag_preds[, 1:k, drop = FALSE]
final <- apply(votes, 1, function(row) {
tbl <- table(row)
names(which.max(tbl))
})
mean(final == as.character(test_bin$label))
})
single_tree <- rpart(label ~ ., data = train_bin, control = rpart.control(maxdepth = 4))
single_acc <- mean(predict(single_tree, test_bin, type = "class") == test_bin$label)
ggplot(data.frame(n = 1:n_trees, acc = ensemble_acc), aes(x = n, y = acc)) +
geom_line(color = "#4E79A7", linewidth = 1) +
geom_hline(yintercept = single_acc, linetype = "dashed", color = "#E15759", linewidth = 1) +
annotate("text", x = 155, y = single_acc - 0.02,
label = sprintf("Single Tree: %.1f%%", single_acc * 100),
color = "#E15759", fontface = "bold") +
labs(
title = "Bagging: Ensemble Accuracy as Trees Are Added",
subtitle = "Versicolor vs Virginica (genuinely overlapping classes) — bagging reduces variance and beats the single tree",
x = "Number of Trees in Ensemble", y = "Test Accuracy"
) +
coord_cartesian(ylim = c(0.6, 1.0)) +
theme_minimal()One of Random Forests’ most useful properties is automatic feature ranking. Scikit-Learn measures importance by how much each feature reduces node impurity (Gini or MSE), weighted by the proportion of training samples passing through that node, averaged across all trees [1].
\[\text{Importance}(X_j) = \frac{1}{B} \sum_{b=1}^{B} \sum_{t \in \mathcal{T}_b} \frac{n_t}{N} \cdot \Delta \text{impurity}(t, X_j)\]
where \(B\) = number of trees, \(\mathcal{T}_b\) = set of nodes in tree \(b\) that split on \(X_j\), \(n_t\) = samples at node \(t\), \(N\) = total training samples.
In R’s randomForest, importance is available in two
flavors: - MeanDecreaseAccuracy: how much accuracy
drops when the feature is randomly permuted (OOB-based) -
MeanDecreaseGini: total decrease in node impurity
summed across trees
imp <- importance(rf_iris)
imp_df <- as.data.frame(imp) %>%
rownames_to_column("Feature") %>%
arrange(desc(MeanDecreaseGini))
p1 <- ggplot(imp_df, aes(x = reorder(Feature, MeanDecreaseGini),
y = MeanDecreaseGini, fill = MeanDecreaseGini)) +
geom_col() +
coord_flip() +
scale_fill_viridis_c(option = "plasma", begin = 0.2, end = 0.85) +
labs(title = "Feature Importance\n(Mean Decrease Gini)", x = NULL) +
theme_minimal() +
theme(legend.position = "none")
p2 <- ggplot(imp_df, aes(x = reorder(Feature, MeanDecreaseAccuracy),
y = MeanDecreaseAccuracy, fill = MeanDecreaseAccuracy)) +
geom_col() +
coord_flip() +
scale_fill_viridis_c(option = "plasma", begin = 0.2, end = 0.85) +
labs(title = "Feature Importance\n(Mean Decrease Accuracy)", x = NULL) +
theme_minimal() +
theme(legend.position = "none")
grid.arrange(p1, p2, ncol = 2,
top = "Random Forest Feature Importance — Iris Dataset")Interpretation: Petal features dominate. This matches the Decision Tree results from Week 02 —
Petal.LengthandPetal.Widthare the most discriminative features for classifying Iris species.
Caution: Gini importance can be biased toward features with many unique values (high cardinality). Permutation importance (MeanDecreaseAccuracy) is more reliable but slower to compute.
Beyond how important features are, we can ask: in which trees is each feature actually used for splitting? This heatmap shows how many times each feature is used across all splits in each tree — revealing how the random subset mechanism distributes feature usage across the forest.
set.seed(MASTER_SEED)
n_show_trees <- 60
# Feature usage per tree: count how many splits use each feature
feature_usage <- lapply(1:n_show_trees, function(k) {
tree_k <- getTree(rf_iris, k = k, labelVar = TRUE)
splits <- tree_k[tree_k$status == 1, "split var"]
feats <- names(iris)[-5]
counts <- sapply(feats, function(f) sum(splits == f))
data.frame(Tree = k, Feature = feats, Count = counts)
}) %>% do.call(rbind, .)
ggplot(feature_usage, aes(x = Tree, y = Feature, fill = Count)) +
geom_tile(color = "white", linewidth = 0.3) +
scale_fill_gradient(low = "white", high = "#4E79A7",
name = "Split\nCount") +
scale_x_continuous(breaks = seq(0, n_show_trees, 10)) +
labs(
title = sprintf("Feature Usage Heatmap: First %d Trees of the Random Forest", n_show_trees),
subtitle = "Darker = more splits using that feature in a tree. Random subsets → no feature dominates every tree.",
x = "Tree Index", y = NULL
) +
theme_minimal() +
theme(panel.grid = element_blank())Reading the heatmap: Each cell shows how many splits in tree \(j\) used feature \(i\). Dark columns mean that tree relied heavily on a particular feature. The variation across columns shows that random feature subsetting achieves its goal — different trees learn from different aspects of the data.
One of the most fundamental results in statistical learning theory is the Bias-Variance tradeoff:
\[\text{Expected Error} = \text{Bias}^2 + \text{Variance} + \text{Irreducible Noise}\]
A single deep Decision Tree has low bias (it fits training data well) but high variance (it changes wildly with different training sets). Random Forests dramatically reduce variance while keeping bias roughly constant. We can demonstrate this empirically:
set.seed(MASTER_SEED)
# Simulate a regression task: y = sin(2*pi*x) + noise
n_train <- 80
n_test <- 200
n_repeats <- 50 # number of different training sets
x_test <- seq(0, 1, length.out = n_test)
y_true <- sin(2 * pi * x_test)
# Storage for predictions from each model across repetitions
preds_tree <- matrix(NA, nrow = n_test, ncol = n_repeats)
preds_rf3 <- matrix(NA, nrow = n_test, ncol = n_repeats)
preds_rf50 <- matrix(NA, nrow = n_test, ncol = n_repeats)
for (r in 1:n_repeats) {
x_tr <- runif(n_train)
y_tr <- sin(2 * pi * x_tr) + rnorm(n_train, sd = 0.3)
df_tr <- data.frame(x = x_tr, y = y_tr)
df_te <- data.frame(x = x_test)
# Deep single tree
t_m <- rpart(y ~ x, data = df_tr,
control = rpart.control(maxdepth = 8, cp = 0, minsplit = 2))
preds_tree[, r] <- predict(t_m, df_te)
# RF with 3 trees (small — still high variance)
rf3 <- randomForest(y ~ x, data = df_tr, ntree = 3)
preds_rf3[, r] <- predict(rf3, df_te)
# RF with 50 trees (converged)
rf50 <- randomForest(y ~ x, data = df_tr, ntree = 50)
preds_rf50[, r] <- predict(rf50, df_te)
}
# Compute bias^2 and variance for each model
bv_stats <- function(pred_mat, y_true) {
mean_pred <- rowMeans(pred_mat)
bias2 <- mean((mean_pred - y_true)^2)
variance <- mean(apply(pred_mat, 1, var))
list(bias2 = bias2, variance = variance, total = bias2 + variance)
}
st_tree <- bv_stats(preds_tree, y_true)
st_rf3 <- bv_stats(preds_rf3, y_true)
st_rf50 <- bv_stats(preds_rf50, y_true)
# Plot: mean prediction ± 1 SD ribbon for each model
make_ribbon_df <- function(pred_mat, label) {
data.frame(
x = x_test,
mean = rowMeans(pred_mat),
sd = apply(pred_mat, 1, sd),
model = label
)
}
ribbon_df <- bind_rows(
make_ribbon_df(preds_tree, "Deep Single Tree"),
make_ribbon_df(preds_rf3, "RF (3 trees)"),
make_ribbon_df(preds_rf50, "RF (50 trees)")
) %>%
mutate(model = factor(model, levels = c("Deep Single Tree", "RF (3 trees)", "RF (50 trees)")))
cols_bv <- c("Deep Single Tree" = "#E15759", "RF (3 trees)" = "#F28E2B", "RF (50 trees)" = "#4E79A7")
p_ribbon <- ggplot(ribbon_df, aes(x = x, color = model, fill = model)) +
geom_ribbon(aes(ymin = mean - sd, ymax = mean + sd), alpha = 0.15, color = NA) +
geom_line(aes(y = mean), linewidth = 1.1) +
geom_line(data = data.frame(x = x_test, y = y_true),
aes(x = x, y = y), color = "black", linetype = "dashed",
linewidth = 1, inherit.aes = FALSE) +
scale_color_manual(values = cols_bv) +
scale_fill_manual(values = cols_bv) +
labs(
title = "Bias-Variance Tradeoff: Mean Prediction ± 1 SD over 50 Training Sets",
subtitle = "Black dashed = true function; Ribbon = variance across 50 different training datasets",
x = "x", y = "y", color = NULL, fill = NULL
) +
theme_minimal() +
theme(legend.position = "bottom")
# Bar chart of bias^2 vs variance
bv_df <- data.frame(
Model = rep(c("Deep Single Tree", "RF (3 trees)", "RF (50 trees)"), each = 2),
Component = rep(c("Bias²", "Variance"), 3),
Value = c(st_tree$bias2, st_tree$variance,
st_rf3$bias2, st_rf3$variance,
st_rf50$bias2, st_rf50$variance)
) %>%
mutate(Model = factor(Model, levels = c("Deep Single Tree", "RF (3 trees)", "RF (50 trees)")))
p_bar <- ggplot(bv_df, aes(x = Model, y = Value, fill = Component)) +
geom_col(position = "stack", width = 0.55) +
geom_text(aes(label = sprintf("%.4f", Value)),
position = position_stack(vjust = 0.5),
color = "white", fontface = "bold", size = 3.5) +
scale_fill_manual(values = c("Bias²" = "#E15759", "Variance" = "#4E79A7")) +
labs(
title = "Bias² + Variance Decomposition",
x = NULL, y = "Expected Error", fill = NULL
) +
theme_minimal() +
theme(legend.position = "bottom", axis.text.x = element_text(angle = 15, hjust = 1))
grid.arrange(p_ribbon, p_bar, ncol = 2,
top = sprintf(
"Empirical Bias-Variance: Deep Tree (total=%.4f) | RF-3 (total=%.4f) | RF-50 (total=%.4f)",
st_tree$total, st_rf3$total, st_rf50$total))The takeaway: - A deep single tree shows high variance (wide ribbon — each new training set gives very different predictions) with moderate bias. - RF with 3 trees reduces variance significantly but is still noisy. - RF with 50 trees has the lowest variance — the ribbon nearly collapses to the mean, and that mean closely tracks the true function. This is the power of ensemble averaging.
| Parameter | Controls | Default | Tuning Tip |
|---|---|---|---|
| ntree | Number of trees in the forest | 500 | Plot OOB curve; stop when it flattens |
| mtry | Number of features sampled at each split | √p (classif.) / p/3 (regr.) | Try √p, √p/2, 2√p; tune via OOB |
| nodesize | Minimum size of terminal (leaf) nodes | 1 (classif.) / 5 (regr.) | Larger → smaller trees (regularization) |
| maxnodes | Maximum number of leaf nodes per tree | Unlimited | Equivalent to max_depth control |
| sampsize | Size of bootstrap sample | N (with replacement) | Smaller → more variance reduction (more diversity) |
mtryset.seed(MASTER_SEED)
mtry_vals <- 1:4
oob_by_mtry <- sapply(mtry_vals, function(m) {
rf_m <- randomForest(Species ~ ., data = iris_train, ntree = 300, mtry = m)
rf_m$err.rate[300, "OOB"]
})
ggplot(data.frame(mtry = mtry_vals, oob = oob_by_mtry), aes(x = mtry, y = oob)) +
geom_line(color = "#4E79A7", linewidth = 1.2) +
geom_point(size = 3, color = "#E15759") +
geom_text(aes(label = sprintf("%.3f", oob)), vjust = -0.8, fontface = "bold") +
labs(
title = "OOB Error Rate vs. mtry",
subtitle = "Find the sweet spot — usually near √p",
x = "mtry (features per split)", y = "OOB Error Rate"
) +
scale_x_continuous(breaks = mtry_vals) +
theme_minimal()best_mtry <- mtry_vals[which.min(oob_by_mtry)]
cat("Best mtry:", best_mtry, "| OOB error:", round(min(oob_by_mtry), 4))## Best mtry: 2 | OOB error: 0.0286
set.seed(MASTER_SEED)
nodesizes <- c(1, 3, 5, 10, 20)
results_ns <- lapply(nodesizes, function(ns) {
rf_ns <- randomForest(Species ~ ., data = iris_train,
ntree = 300, nodesize = ns, importance = FALSE)
data.frame(
nodesize = ns,
oob_err = rf_ns$err.rate[300, "OOB"],
test_acc = mean(predict(rf_ns, iris_test) == iris_test$Species)
)
})
results_ns <- do.call(rbind, results_ns)
ggplot(results_ns %>%
pivot_longer(c(oob_err, test_acc), names_to = "metric", values_to = "value") %>%
mutate(metric = recode(metric, oob_err = "OOB Error Rate",
test_acc = "Test Accuracy")),
aes(x = nodesize, y = value, color = metric)) +
geom_line(linewidth = 1.2) +
geom_point(size = 3) +
scale_color_manual(values = c("OOB Error Rate" = "#E15759", "Test Accuracy" = "#4E79A7")) +
labs(
title = "Effect of nodesize on OOB Error and Test Accuracy",
subtitle = "Larger nodesize = shallower, more regularized trees",
x = "nodesize", y = "Value", color = NULL
) +
scale_x_continuous(breaks = nodesizes) +
theme_minimal() +
theme(legend.position = "bottom")Extremely Randomized Trees (Extra-Trees) go one step further than Random Forests [1]:
| Property | Random Forest | Extra-Trees |
|---|---|---|
| Feature subset per split | Random subset | Random subset |
| Split threshold | Optimal (searched exhaustively) | Random (drawn uniformly) |
| Training speed | Slower (best threshold search) | Faster |
| Bias | Lower | Slightly Higher |
| Variance | Higher | Lower |
The random thresholds act as an additional regularizer, often preventing overfitting even better than Random Forests — especially on noisy datasets.
R’s
rpartdoes not have a built-insplitter = "random"option (unlike scikit-learn). We can simulate Extra-Trees by manually building trees with random feature subsets and random thresholds at each split, using a simple recursive partitioning function.
set.seed(MASTER_SEED)
n_et <- 300
# Manual Extra-Tree builder: random feature + random threshold at each split
build_extra_tree <- function(data, target, features, max_depth = 6, min_size = 5, depth = 0) {
labels <- data[[target]]
if (depth >= max_depth || length(unique(labels)) == 1 || nrow(data) < min_size) {
return(list(type = "leaf", class = names(which.max(table(labels)))))
}
# Random feature subset (mtry = sqrt(p))
mtry <- max(1, floor(sqrt(length(features))))
feat_sub <- sample(features, mtry)
best_gini <- Inf; best_feat <- NULL; best_thresh <- NULL
for (f in feat_sub) {
vals <- data[[f]]
# RANDOM threshold from the feature's range (the Extra-Trees key difference)
thresh <- runif(1, min(vals), max(vals))
left_idx <- which(vals <= thresh)
right_idx <- which(vals > thresh)
if (length(left_idx) < 2 || length(right_idx) < 2) next
gini_l <- 1 - sum((table(labels[left_idx]) / length(left_idx))^2)
gini_r <- 1 - sum((table(labels[right_idx]) / length(right_idx))^2)
gini <- (length(left_idx) * gini_l + length(right_idx) * gini_r) / nrow(data)
if (gini < best_gini) {
best_gini <- gini; best_feat <- f; best_thresh <- thresh
}
}
if (is.null(best_feat)) {
return(list(type = "leaf", class = names(which.max(table(labels)))))
}
left_data <- data[data[[best_feat]] <= best_thresh, ]
right_data <- data[data[[best_feat]] > best_thresh, ]
list(type = "split", feature = best_feat, threshold = best_thresh,
left = build_extra_tree(left_data, target, features, max_depth, min_size, depth + 1),
right = build_extra_tree(right_data, target, features, max_depth, min_size, depth + 1))
}
predict_extra_tree <- function(tree, newdata) {
sapply(1:nrow(newdata), function(i) {
node <- tree
while (node$type == "split") {
node <- if (newdata[[node$feature]][i] <= node$threshold) node$left else node$right
}
node$class
})
}
features <- names(iris_train)[1:4]
et_preds <- matrix(NA, nrow = nrow(iris_test), ncol = n_et)
for (i in 1:n_et) {
boot_i <- sample(1:nrow(iris_train), nrow(iris_train), replace = TRUE)
et_tree <- build_extra_tree(iris_train[boot_i, ], "Species", features)
et_preds[, i] <- predict_extra_tree(et_tree, iris_test)
}
et_acc_by_n <- sapply(1:n_et, function(k) {
votes <- et_preds[, 1:k, drop = FALSE]
final <- apply(votes, 1, function(row) names(which.max(table(row))))
mean(final == as.character(iris_test$Species))
})
# RF baseline: predict with first k trees from rf_iris (must use predict.all)
all_rf_test_preds <- predict(rf_iris, iris_test, predict.all = TRUE)$individual
n_rf_trees <- ncol(all_rf_test_preds)
rf_acc_by_n <- sapply(1:n_et, function(k) {
k_use <- min(k, n_rf_trees)
votes <- all_rf_test_preds[, 1:k_use, drop = FALSE]
final <- apply(votes, 1, function(row) names(which.max(table(row))))
mean(final == as.character(iris_test$Species))
})
ggplot(data.frame(n = 1:n_et,
ExtraTrees = et_acc_by_n,
RandomForest = rf_acc_by_n) %>%
pivot_longer(-n, names_to = "model", values_to = "acc"),
aes(x = n, y = acc, color = model)) +
geom_line(linewidth = 1, alpha = 0.8) +
scale_color_manual(values = c(ExtraTrees = "#F28E2B", RandomForest = "#4E79A7")) +
labs(
title = "Extra-Trees vs. Random Forest: Test Accuracy by Number of Trees",
subtitle = "Extra-Trees train faster but require more trees to converge",
x = "Number of Trees", y = "Test Accuracy", color = NULL
) +
coord_cartesian(ylim = c(0.85, 1.0)) +
theme_minimal() +
theme(legend.position = "bottom")Random Forests work equally well for regression tasks. Instead of voting on class labels, trees average their numerical predictions. Node impurity is measured by MSE instead of Gini/Entropy.
\[\hat{y}(x) = \frac{1}{B} \sum_{b=1}^{B} T_b(x)\]
data(mtcars)
set.seed(MASTER_SEED)
# Predict mpg from car attributes
tr_idx <- sample(1:nrow(mtcars), 22)
mtcars_train <- mtcars[tr_idx, ]
mtcars_test <- mtcars[-tr_idx, ]
# Compare: single tree vs random forest
single_reg <- rpart(mpg ~ ., data = mtcars_train, control = rpart.control(maxdepth = 3))
rf_reg <- randomForest(mpg ~ ., data = mtcars_train, ntree = 500, importance = TRUE)
pred_single <- predict(single_reg, mtcars_test)
pred_rf <- predict(rf_reg, mtcars_test)
rmse <- function(y, yhat) sqrt(mean((y - yhat)^2))
cat(sprintf("Single Tree RMSE : %.3f mpg\n", rmse(mtcars_test$mpg, pred_single)))## Single Tree RMSE : 3.755 mpg
## Random Forest RMSE: 2.297 mpg
# Side-by-side predicted vs actual
plot_df <- data.frame(
actual = mtcars_test$mpg,
tree_pred = pred_single,
rf_pred = pred_rf
)
p_tree <- ggplot(plot_df, aes(x = actual, y = tree_pred)) +
geom_abline(slope = 1, intercept = 0, linetype = "dashed", color = "grey50") +
geom_point(color = "#E15759", size = 3, alpha = 0.8) +
labs(title = "Single Regression Tree",
subtitle = sprintf("RMSE = %.2f mpg", rmse(mtcars_test$mpg, pred_single)),
x = "Actual MPG", y = "Predicted MPG") +
theme_minimal()
p_rf <- ggplot(plot_df, aes(x = actual, y = rf_pred)) +
geom_abline(slope = 1, intercept = 0, linetype = "dashed", color = "grey50") +
geom_point(color = "#4E79A7", size = 3, alpha = 0.8) +
labs(title = "Random Forest Regression",
subtitle = sprintf("RMSE = %.2f mpg", rmse(mtcars_test$mpg, pred_rf)),
x = "Actual MPG", y = "Predicted MPG") +
theme_minimal()
grid.arrange(p_tree, p_rf, ncol = 2,
top = "Single Tree vs. Random Forest — mtcars Regression")imp_reg <- importance(rf_reg)
imp_reg_df <- as.data.frame(imp_reg) %>%
rownames_to_column("Feature") %>%
arrange(desc(`%IncMSE`))
ggplot(imp_reg_df, aes(x = reorder(Feature, `%IncMSE`), y = `%IncMSE`, fill = `%IncMSE`)) +
geom_col() +
coord_flip() +
scale_fill_viridis_c(option = "plasma", begin = 0.2, end = 0.85) +
labs(
title = "Feature Importance for MPG Prediction (Random Forest)",
subtitle = "% Increase in MSE when feature is randomly permuted",
x = NULL, y = "% Increase in MSE"
) +
theme_minimal() +
theme(legend.position = "none")Bagging and Random Forests build trees in parallel on independent bootstrap samples. Boosting takes a fundamentally different approach: trees are built sequentially, each one trying to correct the errors of the previous one [1].
AdaBoost focuses subsequent learners on the hard examples (those the previous learner got wrong) by re-weighting training instances.
Algorithm:
# Simulate AdaBoost weight evolution
set.seed(MASTER_SEED)
n_obs <- 20
true_labels <- rep(c(1, -1), each = 10)
weights <- rep(1/n_obs, n_obs)
weight_history <- matrix(NA, nrow = 5, ncol = n_obs)
weight_history[1, ] <- weights
for (round in 1:4) {
# Weighted sampling: higher-weight instances are more likely to be misclassified
# (weak learner trained on weighted data still gets some hard cases wrong)
n_wrong <- sample(2:5, 1)
wrong <- sample(1:n_obs, n_wrong, prob = weights, replace = FALSE)
r_j <- sum(weights[wrong])
r_j <- max(min(r_j, 0.49), 0.01)
alpha_j <- log((1 - r_j) / r_j)
weights[wrong] <- weights[wrong] * exp(alpha_j)
weights <- weights / sum(weights)
weight_history[round + 1, ] <- weights
}
weight_df <- as.data.frame(weight_history) %>%
mutate(round = 0:4) %>%
pivot_longer(-round, names_to = "obs", values_to = "weight") %>%
mutate(obs = as.integer(gsub("V", "", obs)))
ggplot(weight_df, aes(x = obs, y = weight, fill = factor(round))) +
geom_col(position = "dodge") +
facet_wrap(~paste("Round", round), nrow = 1) +
scale_fill_viridis_d(option = "plasma", begin = 0.2, end = 0.85) +
labs(
title = "AdaBoost: Instance Weights Across Rounds",
subtitle = "Misclassified instances get higher weights — harder examples dominate future learners",
x = "Training Instance", y = "Weight"
) +
theme_minimal() +
theme(legend.position = "none", axis.text.x = element_blank())Gradient Boosting is more general than AdaBoost. Instead of re-weighting instances, it fits each new tree to the residual errors (pseudo-residuals) of the previous ensemble [1].
\[F_m(x) = F_{m-1}(x) + \eta \cdot h_m(x)\]
where \(h_m\) is a weak learner fit to the negative gradient of the loss function.
The learning_rate (\(\eta\)) scales each tree’s contribution —
lower = more trees needed, but better generalization
(shrinkage regularization).
set.seed(MASTER_SEED)
# Simulate a noisy quadratic function
n_pts <- 200
X_sim <- runif(n_pts, -3, 3)
y_sim <- X_sim^2 - 2 * X_sim + rnorm(n_pts, sd = 1.5)
sim_df <- data.frame(x = X_sim, y = y_sim)
train_sim <- sim_df[1:150, ]
test_sim <- sim_df[151:200, ]
# Manual gradient boosting (MSE loss, residuals = negative gradient)
lr <- 0.3
trees <- list()
F_pred_train <- rep(mean(train_sim$y), nrow(train_sim))
F_pred_test <- rep(mean(train_sim$y), nrow(test_sim))
stages <- c(1, 3, 5, 10, 20, 50)
stage_preds <- list()
for (m in 1:max(stages)) {
residuals_m <- train_sim$y - F_pred_train
tree_m <- rpart(residuals_m ~ x, data = train_sim,
control = rpart.control(maxdepth = 2, minsplit = 5))
trees[[m]] <- tree_m
F_pred_train <- F_pred_train + lr * predict(tree_m, train_sim)
F_pred_test <- F_pred_test + lr * predict(tree_m, test_sim)
if (m %in% stages) {
stage_preds[[as.character(m)]] <- F_pred_test
}
}
x_seq <- seq(-3, 3, length.out = 200)
plot_list <- lapply(names(stage_preds), function(s) {
pred_vals <- stage_preds[[s]]
rmse_s <- sqrt(mean((test_sim$y - pred_vals)^2))
pred_curve_y <- rep(mean(train_sim$y), 200)
F_curve <- rep(mean(train_sim$y), 200)
for (t_m in trees[1:as.integer(s)]) {
F_curve <- F_curve + lr * predict(t_m, data.frame(x = x_seq))
}
ggplot(train_sim, aes(x = x, y = y)) +
geom_point(alpha = 0.3, size = 1.5, color = "grey50") +
geom_line(data = data.frame(x = x_seq, y = F_curve),
aes(x = x, y = y), color = "#4E79A7", linewidth = 1.2) +
stat_function(fun = function(x) x^2 - 2*x, color = "#E15759",
linetype = "dashed", linewidth = 1) +
labs(
title = paste0(s, ifelse(s == 1, " tree", " trees")),
subtitle = sprintf("Test RMSE: %.2f", rmse_s),
x = "x", y = "y"
) +
coord_cartesian(ylim = c(-5, 10)) +
theme_minimal(base_size = 9)
})
do.call(grid.arrange, c(plot_list, ncol = 3,
top = "Gradient Boosting: Fitting Residuals Stage by Stage\n(Blue = Ensemble, Red Dashed = True Function)"))gbmset.seed(MASTER_SEED)
iris_gbm <- iris
iris_gbm$Species_bin <- ifelse(iris$Species == "virginica", 1, 0)
gbm_fit <- gbm(
Species_bin ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width,
data = iris_gbm[train_idx, ],
distribution = "bernoulli",
n.trees = 200,
interaction.depth = 3,
shrinkage = 0.1,
cv.folds = 5,
verbose = FALSE
)
best_iter <- gbm.perf(gbm_fit, method = "cv", plot.it = FALSE)
cat("Best number of trees (CV):", best_iter)## Best number of trees (CV): 43
# Feature importance from GBM
imp_gbm <- summary(gbm_fit, n.trees = best_iter, plotit = FALSE)
ggplot(imp_gbm, aes(x = reorder(var, rel.inf), y = rel.inf, fill = rel.inf)) +
geom_col() +
coord_flip() +
scale_fill_viridis_c(option = "plasma", begin = 0.2, end = 0.85) +
labs(
title = "GBM Feature Importance (virginica vs. other)",
subtitle = "Relative influence = % reduction in loss from each feature",
x = NULL, y = "Relative Influence (%)"
) +
theme_minimal() +
theme(legend.position = "none")set.seed(MASTER_SEED)
# 5-fold CV comparison on iris (versicolor vs. other, binary)
iris_bin_df <- iris
iris_bin_df$target <- factor(ifelse(iris$Species == "versicolor", "yes", "no"))
folds <- cut(sample(1:150), breaks = 5, labels = FALSE)
cv_acc <- sapply(1:5, function(fold) {
tr <- iris_bin_df[folds != fold, ]
te <- iris_bin_df[folds == fold, ]
# Single tree
t_m <- rpart(target ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width,
data = tr, control = rpart.control(maxdepth = 4))
acc_tree <- mean(predict(t_m, te, type = "class") == te$target)
# Random Forest
rf_m <- randomForest(target ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width,
data = tr, ntree = 200)
acc_rf <- mean(predict(rf_m, te) == te$target)
c(tree = acc_tree, rf = acc_rf)
})
cv_results <- data.frame(
Model = c("Decision Tree", "Random Forest"),
Mean_Acc = rowMeans(cv_acc),
SD_Acc = apply(cv_acc, 1, sd)
)
knitr::kable(cv_results, digits = 4,
col.names = c("Model", "Mean CV Accuracy", "SD"),
caption = "5-Fold CV Accuracy: Single Tree vs. Random Forest")| Model | Mean CV Accuracy | SD | |
|---|---|---|---|
| tree | Decision Tree | 0.9267 | 0.0149 |
| rf | Random Forest | 0.9467 | 0.0183 |
| Property | Decision Tree | Random Forest | Extra-Trees | Gradient Boosting |
|---|---|---|---|---|
| Training speed | Fast | Moderate | Fast ✓ | Slow (sequential) |
| Prediction speed | Fast | Moderate | Moderate | Fast |
| Parallelizable | N/A | Yes ✓ | Yes ✓ | Partial |
| Handles overfitting | Poor (needs pruning) | Good ✓ | Better ✓ | Good (shrinkage) |
| Interpretable | High ✓ | Low ✗ | Low ✗ | Low ✗ |
| Feature importance | Yes (limited) | Yes ✓ | Yes ✓ | Yes ✓ |
| Works on small data | OK | OK | OK | OK |
| Handles missing values | Some (surrogates) | Some (OOB) | Some | Some |
| Hyperparameter sensitivity | Low | Low-Medium | Low | High ! |
| Scenario | Recommended Method |
|---|---|
| You need to explain the model to stakeholders | Single Decision Tree (or shallow RF) |
| You want the best out-of-the-box performance | Random Forest |
| Training time is limited | Extra-Trees (random thresholds = faster) |
| You have very noisy data | Random Forest (variance reduction handles noise) |
| You need to rank features | Random Forest (permutation importance is robust) |
| You want state-of-the-art on tabular data | Gradient Boosting / XGBoost (wins Kaggle) |
Stacking (stacked generalization) replaces the simple voting/averaging aggregation with a trained meta-learner [1]:
set.seed(MASTER_SEED)
data(iris)
iris_bin2 <- iris
iris_bin2$target <- factor(ifelse(iris$Species == "virginica", 1, 0))
n <- nrow(iris_bin2)
k <- 5
fold_ids <- sample(rep(1:k, length.out = n))
meta_features <- matrix(NA, nrow = n, ncol = 2)
colnames(meta_features) <- c("tree_prob", "rf_prob")
for (f in 1:k) {
tr <- iris_bin2[fold_ids != f, ]
te <- iris_bin2[fold_ids == f, ]
t_m <- rpart(target ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width,
data = tr, control = rpart.control(maxdepth = 3))
rf_m <- randomForest(target ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width,
data = tr, ntree = 100)
meta_features[fold_ids == f, "tree_prob"] <- predict(t_m, te, type = "prob")[, "1"]
meta_features[fold_ids == f, "rf_prob"] <- predict(rf_m, te, type = "prob")[, "1"]
}
meta_df <- as.data.frame(meta_features)
meta_df$target <- iris_bin2$target
ggplot(meta_df, aes(x = tree_prob, y = rf_prob, color = target)) +
geom_point(size = 3, alpha = 0.8) +
geom_abline(slope = 1, intercept = 0, linetype = "dashed", color = "grey60") +
scale_color_manual(values = c("0" = "#4E79A7", "1" = "#E15759"),
labels = c("0" = "Other", "1" = "Virginica")) +
labs(
title = "Stacking: Out-of-Fold Meta-Features",
subtitle = "Each point = one training example. Meta-learner learns from these predictions.",
x = "Decision Tree: P(virginica)", y = "Random Forest: P(virginica)",
color = "True Class"
) +
theme_minimal()Note: The meta-features from each base learner become the inputs to the meta-learner (blender). In practice, you’d train a simple logistic regression or another RF on
meta_dfto produce the final ensemble prediction.
nodesize (larger minimum leaf → shallower
trees), (2) decrease maxnodes (limit leaf count), (3)
increase mtry (consider more features, more competition →
less overfitting in some cases), (4) reduce ntree slightly
if computation allows, (5) use OOB error curve to confirm trees have
converged [1][2].
In this tutorial, we built from the instability problem of a single Decision Tree to a full ensemble framework:
| Concept | Key Idea | R Tool |
|---|---|---|
| Voting | Aggregate diverse models | Manual + randomForest |
| Bagging | Bootstrap + average | Manual loop + rpart |
| Random Forest | Bagging + random features | randomForest() |
| OOB Evaluation | Free cross-validation | rf$err.rate |
| Feature Importance | Impurity or permutation | importance(), varImpPlot() |
| Extra-Trees | Random thresholds → faster | rpart(splitter="random") |
| Gradient Boosting | Sequential residual fitting | gbm() |
| Stacking | Meta-learner on OOF preds | Manual k-fold loop |
The Big Picture: Random Forests are one of the most reliable out-of-the-box algorithms in machine learning — they handle high dimensionality, mixed feature types, and noisy data with minimal hyperparameter tuning. Understanding them deeply (bagging, decorrelation, OOB, importance) builds the foundation for understanding gradient boosting and other advanced ensembles.
[1] A. Géron, Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow, 2nd ed. Sebastopol, CA: O’Reilly Media, 2019, ch. 7.
[2] B. C. Boehmke and B. M. Greenwell, Hands-On Machine Learning with R. Boca Raton, FL: CRC Press, 2020, ch. 10–11. [Online]. Available: https://bradleyboehmke.github.io/HOML/
[3] L. Breiman, “Bagging Predictors,” Machine Learning, vol. 24, no. 2, pp. 123–140, 1996.
[4] L. Breiman, “Random Forests,” Machine Learning, vol. 45, no. 1, pp. 5–32, 2001.
[5] P. Geurts, D. Ernst, and L. Wehenkel, “Extremely Randomized Trees,” Machine Learning, vol. 63, no. 1, pp. 3–42, 2006.
[6] Y. Freund and R. E. Schapire, “A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting,” Journal of Computer and System Sciences, vol. 55, no. 1, pp. 119–139, 1997.
[7] J. H. Friedman, “Greedy Function Approximation: A Gradient Boosting Machine,” Annals of Statistics, vol. 29, no. 5, pp. 1189–1232, 2001.
[8] A. Liaw and M. Wiener, “Classification and Regression by randomForest,” R News, vol. 2, no. 3, pp. 18–22, 2002. [Online]. Available: https://CRAN.R-project.org/package=randomForest
[9] Z. Huang, “seedhash: Deterministic Seed Generation from String Inputs Using MD5 Hashing,” R package, 2026. [Online]. Available: https://github.com/melhzy/seedhash