Decision Trees are one of the most intuitive and powerful models in machine learning. They are versatile — capable of performing both classification and regression — and they form the fundamental building blocks for more advanced algorithms like Random Forests.
Think of a Decision Tree like a game of “20 Questions.” You start with a question about your data, and the answer leads you to the next question, until you arrive at a final prediction.
“Decision Trees are intuitive, and their decisions are easy to interpret. Such models are often called white box models. In contrast, Random Forests or neural networks are generally considered black box models — they make great predictions, but it is usually hard to explain in simple terms why a particular prediction was made.” — A. Géron [1]
“Tree-based models are a class of nonparametric algorithms that work by partitioning the feature space into a number of smaller (non-overlapping) regions with similar response values using a set of splitting rules… decision trees offer many benefits; however, they typically lack in predictive performance compared to more complex algorithms. However, future chapters will discuss powerful ensemble algorithms — like random forests and gradient boosting machines — which are constructed by combining together many decision trees in a clever way.” — B. C. Boehmke & B. M. Greenwell [2]
Learning Objectives:
By the end of this tutorial, you will be able to:
cp,
max_depth, minsplit).A Decision Tree is a flowchart-like structure. Every tree has three types of nodes:
| Node Type | Role |
|---|---|
| Root Node | The very first question — applied to the entire dataset |
| Internal Node | A follow-up question applied to a subset of the data |
| Leaf Node | A terminal node — gives the final prediction (class or value) |
Each node stores four key pieces of information:
Petal.Length < 2.45)samples — how many training instances
reached this nodevalue — how many instances of each
class are at this nodegini — the impurity of this node (0 =
perfectly pure)Let’s build our first tree using a classic teaching dataset: predicting whether to play outside based on weather.
play_data <- data.frame(
Outlook = c("Sunny","Sunny","Overcast","Rainy","Rainy","Rainy",
"Overcast","Sunny","Sunny","Rainy","Sunny","Overcast","Overcast","Rainy"),
Temperature = c("Hot","Hot","Hot","Mild","Cool","Cool","Cool",
"Mild","Cool","Mild","Mild","Mild","Hot","Mild"),
Humidity = c("High","High","High","High","Normal","Normal","Normal",
"High","Normal","Normal","Normal","High","Normal","High"),
Windy = c(FALSE,TRUE,FALSE,FALSE,FALSE,TRUE,TRUE,
FALSE,FALSE,FALSE,TRUE,TRUE,FALSE,TRUE),
Play = c("No","No","Yes","Yes","Yes","No","Yes",
"No","Yes","Yes","Yes","Yes","Yes","No")
)
knitr::kable(play_data, caption = "The 'Play Outside' Dataset (14 observations)")| Outlook | Temperature | Humidity | Windy | Play |
|---|---|---|---|---|
| Sunny | Hot | High | FALSE | No |
| Sunny | Hot | High | TRUE | No |
| Overcast | Hot | High | FALSE | Yes |
| Rainy | Mild | High | FALSE | Yes |
| Rainy | Cool | Normal | FALSE | Yes |
| Rainy | Cool | Normal | TRUE | No |
| Overcast | Cool | Normal | TRUE | Yes |
| Sunny | Mild | High | FALSE | No |
| Sunny | Cool | Normal | FALSE | Yes |
| Rainy | Mild | Normal | FALSE | Yes |
| Sunny | Mild | Normal | TRUE | Yes |
| Overcast | Mild | High | TRUE | Yes |
| Overcast | Hot | Normal | FALSE | Yes |
| Rainy | Mild | High | TRUE | No |
# minsplit = 2 allows splits on this small 14-row teaching example
# (default minsplit = 20 would prevent any splits)
play_tree <- rpart(Play ~ ., data = play_data, method = "class",
control = rpart.control(minsplit = 2))
print(play_tree)## n= 14
##
## node), split, n, loss, yval, (yprob)
## * denotes terminal node
##
## 1) root 14 5 Yes (0.3571429 0.6428571)
## 2) Outlook=Rainy,Sunny 10 5 No (0.5000000 0.5000000)
## 4) Humidity=High 5 1 No (0.8000000 0.2000000)
## 8) Outlook=Sunny 3 0 No (1.0000000 0.0000000) *
## 9) Outlook=Rainy 2 1 No (0.5000000 0.5000000)
## 18) Windy>=0.5 1 0 No (1.0000000 0.0000000) *
## 19) Windy< 0.5 1 0 Yes (0.0000000 1.0000000) *
## 5) Humidity=Normal 5 1 Yes (0.2000000 0.8000000)
## 10) Windy>=0.5 2 1 No (0.5000000 0.5000000)
## 20) Outlook=Rainy 1 0 No (1.0000000 0.0000000) *
## 21) Outlook=Sunny 1 0 Yes (0.0000000 1.0000000) *
## 11) Windy< 0.5 3 0 Yes (0.0000000 1.0000000) *
## 3) Outlook=Overcast 4 0 Yes (0.0000000 1.0000000) *
rpart.plot(play_tree,
type = 4,
extra = 101,
box.palette = "BuGn",
main = "Decision Tree: Should We Play Outside?")Every node in the plot produced by
rpart.plot(..., extra = 101) displays three
lines:
┌─────────────┐
│ Yes │ ← Line 1: Predicted class (majority class at this node)
│ 5 9 │ ← Line 2: Class counts in alphabetical order (No=5, Yes=9)
│ 100% │ ← Line 3: % of total training data that reached this node
└─────────────┘
Let’s trace each piece for the root node
(Yes | 5 9 | 100%):
| Line | Value | Meaning |
|---|---|---|
| Predicted class | Yes |
The majority class: 9 Yes vs. 5 No, so the node predicts “Yes” |
| Class counts | 5 9 |
5 observations of class “No” and 9 of class “Yes” (alphabetical order) |
| % of data | 100% |
14 out of 14 training instances pass through the root — always 100% |
Now trace a leaf node
(No | 3 0 | 21%):
| Line | Value | Meaning |
|---|---|---|
| Predicted class | No |
All samples here are “No” — a perfectly pure leaf |
| Class counts | 3 0 |
3 “No” instances and 0 “Yes” instances |
| % of data | 21% |
3 out of 14 total training instances end up at this leaf (3/14 ≈ 21%) |
Branch direction: The left branch always follows the
condition printed above it (TRUE path); the right branch is FALSE. For
example, the root splits on Outlook = Rainy, Sunny —
observations where Outlook is Rainy or Sunny go left;
Overcast goes right.
Node color: In the BuGn palette,
darker green = higher confidence “Yes”; blue =
“No”. The shade intensity reflects how pure the node is — a
perfectly pure leaf is solid-colored, a mixed node is lighter.
Key insight: The percentages across all leaf nodes must add up to 100% — every training observation ends up in exactly one leaf. Checking this is a quick sanity test: 21% + 7% + 7% + 7% + 7% + 21% + 29% = 99% ≈ 100% (rounding).
## Root node breakdown:
## Total observations: 14
## Class 'No' count: 5 ( 36 % )
## Class 'Yes' count: 9 ( 64 % )
## Predicted class: Yes (majority)
## Gini impurity: 1 - ((5/14)^2 + (9/14)^2) = 0.459
How to read this tree (summary):
3 0 is
perfectly pure (Gini = 0) — ideal. 1 1 is
maximally impure (Gini = 0.5).The tree must decide which feature and which threshold to use at each node. It does this by measuring impurity — how mixed the classes are within each resulting group. A good split creates child nodes that are as pure (unmixed) as possible.
The algorithm used in rpart is CART
(Classification and Regression Trees). It is a greedy
algorithm: at every node, it searches through every feature and
every possible split threshold, picks the one that minimizes impurity in
the two resulting child nodes, then repeats recursively.
CART does not look ahead. It picks the locally best split at each step without checking if a suboptimal split now might lead to a better tree later. This is why we must settle for a “reasonably good” solution rather than a guaranteed optimal one.
The default impurity measure in CART is Gini Impurity:
\[G_i = 1 - \sum_{k=1}^{n} p_{i,k}^2\]
Where \(p_{i,k}\) is the proportion of class \(k\) instances at node \(i\).
calculate_gini <- function(p1) 1 - (p1^2 + (1 - p1)^2)
gini_df <- data.frame(p = seq(0, 1, 0.01)) |>
mutate(gini = calculate_gini(p))
ggplot(gini_df, aes(x = p, y = gini)) +
geom_line(color = "darkcyan", linewidth = 1.5) +
geom_point(aes(x = 0.5, y = 0.5), color = "red", size = 4) +
annotate("text", x = 0.5, y = 0.46, label = "Max Impurity (0.5)", color = "red", size = 4) +
annotate("text", x = 0.05, y = 0.04, label = "Pure\n(Gini = 0)", color = "darkgreen", size = 3.5) +
annotate("text", x = 0.95, y = 0.04, label = "Pure\n(Gini = 0)", color = "darkgreen", size = 3.5) +
labs(title = "Gini Impurity for a Two-Class Problem",
x = "Proportion of Class 'Yes'", y = "Gini Impurity") +
theme_minimal(base_size = 13)Entropy is a second common impurity measure, borrowed from information theory. It measures the average surprise or uncertainty in a node:
\[H_i = -\sum_{k=1}^{n} p_{i,k} \log_2(p_{i,k}) \quad (\text{terms where } p_{i,k} = 0 \text{ are set to 0})\]
entropy_fn <- function(p) {
if (p == 0 | p == 1) return(0)
-(p * log2(p) + (1 - p) * log2(1 - p))
}
compare_df <- data.frame(p = seq(0.001, 0.999, 0.001)) |>
mutate(
Gini = calculate_gini(p),
Entropy = sapply(p, entropy_fn) / 2 # scaled to [0,0.5] for visual comparison
) |>
pivot_longer(c(Gini, Entropy), names_to = "Measure", values_to = "Value")
ggplot(compare_df, aes(x = p, y = Value, color = Measure)) +
geom_line(linewidth = 1.4) +
scale_color_manual(values = c("darkcyan", "coral")) +
labs(title = "Gini Impurity vs. Entropy (Entropy scaled by ½ for comparison)",
x = "Proportion of Positive Class", y = "Impurity Value",
caption = "Both measures peak at 0.5 (maximum disorder) and reach 0 at purity.") +
theme_minimal(base_size = 13) +
theme(legend.position = "top")Which should you use?
| Property | Gini | Entropy |
|---|---|---|
| Computation | Faster (no logarithm) | Slightly slower |
| Behavior | Tends to isolate the most frequent class in its own branch | Tends to produce slightly more balanced trees |
Default in rpart |
Yes (parms = list(split = "gini")) |
No (parms = list(split = "information")) |
In practice, they produce very similar trees. Gini is the standard default.
par(mfrow = c(1, 2))
tree_gini <- rpart(Species ~ Petal.Length + Petal.Width, data = iris,
method = "class", parms = list(split = "gini"))
tree_entropy <- rpart(Species ~ Petal.Length + Petal.Width, data = iris,
method = "class", parms = list(split = "information"))
rpart.plot(tree_gini, main = "Gini Impurity", box.palette = "auto", extra = 101)
rpart.plot(tree_entropy, main = "Entropy (Info Gain)", box.palette = "auto", extra = 101)At every node, CART evaluates every possible split and picks the one minimizing this weighted cost:
\[J(k, t_k) = \frac{m_{\text{left}}}{m} G_{\text{left}} + \frac{m_{\text{right}}}{m} G_{\text{right}}\]
Let’s calculate this manually for the first split in the iris tree
(Petal.Length < 2.45):
# Reproduce the iris root split: Petal.Length < 2.45
left <- iris[iris$Petal.Length < 2.45, ] # setosa only
right <- iris[iris$Petal.Length >= 2.45, ] # versicolor + virginica
gini_node <- function(df) {
props <- table(df$Species) / nrow(df)
1 - sum(props^2)
}
m <- nrow(iris)
m_left <- nrow(left)
m_right <- nrow(right)
G_left <- gini_node(left)
G_right <- gini_node(right)
J <- (m_left / m) * G_left + (m_right / m) * G_right
cat("--- Root split: Petal.Length < 2.45 ---\n")## --- Root split: Petal.Length < 2.45 ---
## Left node: 50 samples, Gini = 0.0000
## Right node: 100 samples, Gini = 0.5000
cat(sprintf("Weighted cost J = (%.0f/%.0f)*%.4f + (%.0f/%.0f)*%.4f = %.4f\n",
m_left, m, G_left, m_right, m, G_right, J))## Weighted cost J = (50/150)*0.0000 + (100/150)*0.5000 = 0.3333
##
## CART selects the (feature, threshold) pair that minimizes J.
The left node is pure (Gini = 0): all 50 setosa fall there. The right node still has mixed versicolor and virginica, so CART will continue splitting it.
A Decision Tree does not just predict a class — it also returns the probability that an instance belongs to each class. This probability is simply the proportion of training instances of that class in the leaf node where the prediction lands.
# Build a classification tree on iris using Petal features only
iris_tree <- rpart(Species ~ Petal.Length + Petal.Width,
data = iris, method = "class")
# --- Hard class prediction ---
new_flower <- data.frame(Petal.Length = 5.0, Petal.Width = 1.5)
pred_class <- predict(iris_tree, new_flower, type = "class")
pred_prob <- predict(iris_tree, new_flower, type = "prob")
cat("New flower: Petal.Length = 5.0 cm, Petal.Width = 1.5 cm\n\n")## New flower: Petal.Length = 5.0 cm, Petal.Width = 1.5 cm
## Predicted class: versicolor
## Class probabilities:
## setosa versicolor virginica
## 1 0 0.9074 0.0926
# Create a prediction grid over the petal feature space
pl_range <- seq(0.5, 7.5, length.out = 200)
pw_range <- seq(0.0, 2.8, length.out = 200)
grid <- expand.grid(Petal.Length = pl_range, Petal.Width = pw_range)
# Probability of being virginica
grid$prob_virginica <- predict(iris_tree, grid, type = "prob")[, "virginica"]
ggplot(grid, aes(x = Petal.Length, y = Petal.Width, fill = prob_virginica)) +
geom_tile() +
geom_point(data = iris, aes(x = Petal.Length, y = Petal.Width,
shape = Species, fill = NULL), size = 2, alpha = 0.8) +
scale_fill_gradient2(low = "steelblue", mid = "white", high = "tomato",
midpoint = 0.5, name = "P(virginica)") +
scale_shape_manual(values = c(21, 22, 24)) +
labs(title = "Class Probability Map: P(Iris virginica)",
subtitle = "Color shows probability of virginica at each point in feature space",
x = "Petal Length (cm)", y = "Petal Width (cm)") +
theme_minimal(base_size = 13)Notice that the probability is constant within each rectangular region — it jumps only at the split boundaries. This is fundamental to how trees make predictions: every point in the same leaf gets exactly the same probability vector.
One of the most revealing ways to understand what a decision tree has learned is to visualize its decision boundaries — the lines that separate predicted classes in feature space.
Because CART splits on a single feature at a time, all boundaries are axis-aligned rectangles. This is an important limitation: if the true boundary is diagonal, the tree approximates it with a “staircase” of right-angle cuts.
# Grid prediction for all three species
grid$pred_class <- predict(iris_tree, grid, type = "class")
ggplot() +
geom_tile(data = grid,
aes(x = Petal.Length, y = Petal.Width, fill = pred_class),
alpha = 0.3) +
geom_point(data = iris,
aes(x = Petal.Length, y = Petal.Width,
color = Species, shape = Species),
size = 2.5, alpha = 0.9) +
scale_fill_manual(values = c("setosa" = "#4E79A7",
"versicolor" = "#F28E2B",
"virginica" = "#E15759"),
name = "Predicted") +
scale_color_manual(values = c("setosa" = "#4E79A7",
"versicolor" = "#F28E2B",
"virginica" = "#E15759"),
name = "Actual") +
labs(title = "Decision Boundaries of the Iris Tree",
subtitle = "Background color = predicted class | Points = actual class",
x = "Petal Length (cm)", y = "Petal Width (cm)") +
theme_minimal(base_size = 13)rpart.plot(iris_tree, type = 4, extra = 101, box.palette = "auto",
main = "The Iris Tree (Petal Features Only)")Tracing a prediction through the tree:
Petal.Length < 2.45? → If YES → predict
setosa (pure leaf, Gini = 0)Petal.Width < 1.75? → If YES →
predict versicolorThe rectangular region in the top-left of the boundary plot (all blue, all setosa) corresponds exactly to the left branch of the root node.
If a tree is allowed to grow without any restriction, it will keep splitting until every leaf contains exactly one class — or even exactly one training instance. This is called overfitting: the tree memorizes the training data so precisely (including its noise) that it fails to generalize to new data.
par(mfrow = c(1, 2))
# Fully grown tree (default rpart already applies minimal stopping, but cp=0 grows more)
full_tree <- rpart(Species ~ ., data = iris, method = "class",
control = rpart.control(cp = 0, minsplit = 2))
pruned_tree <- rpart(Species ~ ., data = iris, method = "class")
rpart.plot(full_tree, main = "Fully Grown Tree (cp = 0)\nOverfits — too many splits", box.palette = "auto")
rpart.plot(pruned_tree, main = "Default Tree (cp = 0.01)\nMore generalizable", box.palette = "auto")rpart uses a complexity parameter
(cp) to penalize tree size. A split is only
accepted if it improves the overall fit by at least cp.
Larger cp → smaller, simpler tree.
full_iris_tree <- rpart(Species ~ ., data = iris, method = "class",
control = rpart.control(cp = 0))
plotcp(full_iris_tree)##
## Classification tree:
## rpart(formula = Species ~ ., data = iris, method = "class", control = rpart.control(cp = 0))
##
## Variables actually used in tree construction:
## [1] Petal.Length Petal.Width
##
## Root node error: 100/150 = 0.66667
##
## n= 150
##
## CP nsplit rel error xerror xstd
## 1 0.50 0 1.00 1.24 0.046361
## 2 0.44 1 0.50 0.80 0.061101
## 3 0.00 2 0.06 0.10 0.030551
# Select cp at the row with minimum cross-validated error
best_cp <- full_iris_tree$cptable[which.min(full_iris_tree$cptable[, "xerror"]), "CP"]
cat("Optimal cp selected:", round(best_cp, 4), "\n")## Optimal cp selected: 0
pruned_iris_tree <- prune(full_iris_tree, cp = best_cp)
rpart.plot(pruned_iris_tree,
main = "Pruned Iris Tree (Optimal cp)",
box.palette = "auto", extra = 101)Beyond cp, rpart provides hyperparameters
that directly constrain the tree’s structure. These are often more
interpretable than cp:
| Parameter | What it controls | Effect of increasing |
|---|---|---|
max_depth |
Maximum levels from root to leaf | Smaller tree |
minsplit |
Min samples needed to attempt a split | Fewer splits |
minbucket |
Min samples required in any leaf | Fewer, larger leaves |
max_leaf_nodes |
(via maxdepth in rpart) |
Caps total leaves |
# Create a noisier, harder dataset using the moons-like concept with iris
set.seed(42)
n <- 200
x1 <- c(rnorm(n/2, 2, 1), rnorm(n/2, 4, 1))
x2 <- c(rnorm(n/2, 2, 1), rnorm(n/2, 4, 1))
y <- factor(ifelse(x1 + x2 + rnorm(n, 0, 0.5) < 6, "A", "B"))
noisy_data <- data.frame(x1, x2, y)
configs <- list(
list(label = "No restriction (cp=0)", cp = 0, minsplit = 2, maxdepth = 30),
list(label = "max_depth = 2", cp = 0, minsplit = 2, maxdepth = 2),
list(label = "minsplit = 30", cp = 0, minsplit = 30, maxdepth = 30),
list(label = "Balanced (cp=0.01, min=10)", cp = 0.01, minsplit = 10, maxdepth = 30)
)
par(mfrow = c(2, 2), mar = c(4, 4, 3, 1))
for (cfg in configs) {
t <- rpart(y ~ x1 + x2, data = noisy_data, method = "class",
control = rpart.control(cp = cfg$cp, minsplit = cfg$minsplit,
maxdepth = cfg$maxdepth))
grid2 <- expand.grid(x1 = seq(min(x1)-0.5, max(x1)+0.5, length.out = 150),
x2 = seq(min(x2)-0.5, max(x2)+0.5, length.out = 150))
grid2$pred <- predict(t, grid2, type = "class")
plot(grid2$x1, grid2$x2,
col = ifelse(grid2$pred == "A", "#AED6F1", "#FADBD8"),
pch = 15, cex = 0.4, xlab = "x1", ylab = "x2",
main = sprintf("%s\n(leaves: %d)", cfg$label, sum(t$frame$var == "<leaf>")))
points(noisy_data$x1, noisy_data$x2,
col = ifelse(noisy_data$y == "A", "#1A5276", "#922B21"),
pch = ifelse(noisy_data$y == "A", 16, 17),
cex = 0.8)
legend("topright", legend = c("A","B"),
col = c("#1A5276","#922B21"), pch = c(16,17), cex = 0.7)
}Reading the comparison:
cp penalty
with a minimum split count. Usually the best practical approach.Decision trees are highly sensitive to small changes in the training data. Remove a single outlier, and you may get a completely different tree structure. This is called instability, and it is one of the most important limitations to understand.
par(mfrow = c(1, 2))
# Original iris tree
tree_original <- rpart(Species ~ Petal.Length + Petal.Width,
data = iris, method = "class")
rpart.plot(tree_original, main = "Original Iris Tree\n(all 150 flowers)",
box.palette = "auto", extra = 101)
# Remove just ONE borderline versicolor flower
borderline_idx <- which(iris$Petal.Length == max(iris$Petal.Length[iris$Species == "versicolor"]) &
iris$Species == "versicolor")[1]
iris_modified <- iris[-borderline_idx, ]
tree_modified <- rpart(Species ~ Petal.Length + Petal.Width,
data = iris_modified, method = "class")
rpart.plot(tree_modified,
main = sprintf("Modified Tree\n(removed 1 borderline versicolor, n=%d)", nrow(iris_modified)),
box.palette = "auto", extra = 101)grid_compare <- expand.grid(
Petal.Length = seq(0.5, 7.5, length.out = 300),
Petal.Width = seq(0.0, 2.8, length.out = 300)
)
grid_compare$original <- predict(tree_original, grid_compare, type = "class")
grid_compare$modified <- predict(tree_modified, grid_compare, type = "class")
p1 <- ggplot() +
geom_tile(data = grid_compare,
aes(x = Petal.Length, y = Petal.Width, fill = original), alpha = 0.35) +
geom_point(data = iris, aes(x = Petal.Length, y = Petal.Width, color = Species, shape = Species),
size = 2) +
scale_fill_manual(values = c("setosa"="#4E79A7","versicolor"="#F28E2B","virginica"="#E15759"),
name="Predicted") +
scale_color_manual(values = c("setosa"="#4E79A7","versicolor"="#F28E2B","virginica"="#E15759")) +
labs(title = "Original (n=150)", x = "Petal Length", y = "Petal Width") +
theme_minimal() + theme(legend.position = "bottom")
p2 <- ggplot() +
geom_tile(data = grid_compare,
aes(x = Petal.Length, y = Petal.Width, fill = modified), alpha = 0.35) +
geom_point(data = iris_modified,
aes(x = Petal.Length, y = Petal.Width, color = Species, shape = Species), size = 2) +
scale_fill_manual(values = c("setosa"="#4E79A7","versicolor"="#F28E2B","virginica"="#E15759"),
name="Predicted") +
scale_color_manual(values = c("setosa"="#4E79A7","versicolor"="#F28E2B","virginica"="#E15759")) +
labs(title = "After Removing 1 Flower (n=149)", x = "Petal Length", y = "Petal Width") +
theme_minimal() + theme(legend.position = "bottom")
gridExtra::grid.arrange(p1, p2, ncol = 2)Even removing a single borderline observation can shift split thresholds and reorganize entire branches. This is why single decision trees are rarely used in production — Random Forests overcome this by averaging predictions across hundreds of trees built on random subsets of the data, making the ensemble much more stable.
Regression trees follow the same CART logic, but with two key changes:
\[J(k, t_k) = \frac{m_{\text{left}}}{m} \text{MSE}_{\text{left}} + \frac{m_{\text{right}}}{m} \text{MSE}_{\text{right}}\]
x <- seq(0, 10, by = 0.1)
y <- sin(x) + rnorm(length(x), sd = 0.3)
regression_data <- data.frame(x, y)
ggplot(regression_data, aes(x = x, y = y)) +
geom_point(alpha = 0.6, color = "steelblue") +
geom_line(aes(y = sin(x)), color = "black", linewidth = 0.8, linetype = "dashed") +
labs(title = "Noisy Sine Wave — Our Regression Target",
subtitle = "Dashed line = true underlying sine function",
x = "x", y = "y") +
theme_minimal(base_size = 13)regression_tree <- rpart(y ~ x, data = regression_data, method = "anova")
regression_data$predicted_y <- predict(regression_tree, regression_data)
ggplot(regression_data, aes(x = x, y = y)) +
geom_point(alpha = 0.4, color = "gray50") +
geom_step(aes(y = predicted_y), color = "red", linewidth = 1.2) +
geom_line(aes(y = sin(x)), color = "black", linewidth = 0.8, linetype = "dashed") +
labs(title = "Default Regression Tree Predictions",
subtitle = "Red steps = tree prediction | Dashed = true sine function",
x = "x", y = "y") +
theme_minimal(base_size = 13)The tree approximates the smooth curve with step-wise constant values — one constant per leaf region.
depths <- c(1, 3, 6, 15)
plots <- list()
for (d in depths) {
t <- rpart(y ~ x, data = regression_data, method = "anova",
control = rpart.control(maxdepth = d, cp = 0))
regression_data[[paste0("pred_d", d)]] <- predict(t, regression_data)
n_leaves <- sum(t$frame$var == "<leaf>")
plots[[as.character(d)]] <- ggplot(regression_data, aes(x = x)) +
geom_point(aes(y = y), alpha = 0.3, color = "gray50") +
geom_step(aes(y = .data[[paste0("pred_d", d)]]), color = "red", linewidth = 1) +
geom_line(aes(y = sin(x)), color = "black", linewidth = 0.7, linetype = "dashed") +
labs(title = sprintf("max_depth = %d (%d leaves)", d, n_leaves),
subtitle = if (d <= 2) "Underfitting" else if (d >= 10) "Overfitting" else "Good fit",
x = "x", y = "y") +
theme_minimal(base_size = 11)
}
gridExtra::grid.arrange(grobs = plots, ncol = 2)max_depth |
Behavior |
|---|---|
| 1 | Extreme underfitting — only one split, two constants |
| 3 | Reasonable approximation — captures the main waves |
| 6 | Good fit — tracks the curve without chasing noise |
| 15 | Severe overfitting — interpolates individual noisy points |
# Show MSE calculation for a candidate split in the regression tree
split_point <- 5.0
left_y <- regression_data$y[regression_data$x < split_point]
right_y <- regression_data$y[regression_data$x >= split_point]
mse_node <- function(vals) mean((vals - mean(vals))^2)
m <- nrow(regression_data)
J_split <- (length(left_y)/m) * mse_node(left_y) + (length(right_y)/m) * mse_node(right_y)
cat("--- Candidate split: x < 5.0 ---\n")## --- Candidate split: x < 5.0 ---
cat(sprintf("Left region: %d points, mean=%.3f, MSE=%.4f\n",
length(left_y), mean(left_y), mse_node(left_y)))## Left region: 50 points, mean=0.146, MSE=0.5168
cat(sprintf("Right region: %d points, mean=%.3f, MSE=%.4f\n",
length(right_y), mean(right_y), mse_node(right_y)))## Right region: 51 points, mean=0.178, MSE=0.4427
## Weighted cost J = 0.4794
##
## CART evaluates this cost for every possible x threshold and picks the minimum.
| Advantage | Why it matters |
|---|---|
| White box model | Every decision is traceable — you can follow the exact path from input to prediction. Unlike neural networks, you can explain a tree’s decision to a non-technical stakeholder. |
| No feature scaling needed | Trees split on individual feature thresholds, so standardization or normalization is irrelevant. |
| Handles mixed data types | Categorical and numerical features can live in the same tree without preprocessing. |
| Non-linear patterns | Trees are not limited to linear boundaries — they can model complex, non-monotonic relationships. |
| Fast prediction | Prediction is O(log₂ m) — just traversing from root to leaf. Very efficient even for large datasets. |
| Built-in feature selection | Features that don’t improve purity are simply never used. The tree implicitly selects the most informative features. |
| Disadvantage | Details |
|---|---|
| Overfitting | Unconstrained trees memorize training noise. Requires careful regularization. |
| Instability | Small data changes can completely restructure the tree. Random Forests solve this by averaging many trees. |
| Orthogonal boundaries only | All splits are axis-aligned. Diagonal or curved true boundaries are approximated inefficiently with many steps. |
| Greedy algorithm | CART picks the locally best split at each node, which is not guaranteed to produce the globally optimal tree (this is NP-Complete). |
| Biased toward high-cardinality features | Features with many possible split points get more chances to be
selected. Use cp and minsplit to mitigate
this. |
| Operation | Complexity |
|---|---|
| Prediction | O(log₂ m) — traverse from root to leaf |
| Training | O(n × m log₂ m) — compare n features across m samples at each split |
Decision trees are fast to predict and reasonably fast to train. The main cost is finding the best split at each node, which requires checking every feature at every possible threshold.
The instability and overfitting tendency of single decision trees motivated the development of Random Forests: an ensemble of many trees, each built on a random subset of the data and a random subset of the features. The final prediction is the majority vote (classification) or average (regression) across all trees.
# Simulate what happens when we average 5 differently-sampled regression trees
set.seed(42)
n_trees <- 5
all_preds <- matrix(NA, nrow = nrow(regression_data), ncol = n_trees)
for (i in seq_len(n_trees)) {
boot_idx <- sample(nrow(regression_data), replace = TRUE)
boot_dat <- regression_data[boot_idx, c("x", "y")]
t_i <- rpart(y ~ x, data = boot_dat, method = "anova",
control = rpart.control(maxdepth = 4, cp = 0))
all_preds[, i] <- predict(t_i, regression_data)
}
regression_data$forest_avg <- rowMeans(all_preds)
ggplot(regression_data, aes(x = x)) +
geom_point(aes(y = y), alpha = 0.3, color = "gray60") +
geom_line(aes(y = sin(x)), color = "black", linewidth = 0.8, linetype = "dashed") +
geom_step(aes(y = predicted_y), color = "red", linewidth = 0.8, alpha = 0.6) +
geom_line(aes(y = forest_avg), color = "darkgreen", linewidth = 1.4) +
annotate("text", x = 8.5, y = 1.1, label = "Single tree", color = "red", size = 4) +
annotate("text", x = 8.5, y = 0.85, label = "Forest avg (5 trees)", color = "darkgreen", size = 4) +
annotate("text", x = 8.5, y = 0.6, label = "True sine", color = "black", size = 4) +
labs(title = "Single Tree vs. Forest Average",
subtitle = "Averaging across trees smooths out the step-function artifact",
x = "x", y = "y") +
theme_minimal(base_size = 13)The ensemble average is visibly smoother and closer to the true function than any single tree. This motivates Week 3: Random Forests.
When a tree splits on a feature, it reduces the impurity (or error) at that node. Summing this reduction across every split where a feature is used gives its total contribution to the model’s predictive power. Features that are never used in any split have zero importance.
As described by Boehmke & Greenwell [2]:
“To measure feature importance, the reduction in the loss function (e.g., SSE) attributed to each variable at each split is tabulated. In some instances, a single variable could be used multiple times in a tree; consequently, the total reduction in the loss function across all splits by a variable are summed up and used as the total feature importance.”
This means a single feature can appear in multiple splits — and its importance accumulates across all of them.
# Build a full classification tree on all iris features
iris_full_tree <- rpart(Species ~ ., data = iris, method = "class",
control = rpart.control(cp = 0.001))
# rpart stores total impurity reduction per feature in $variable.importance
imp <- iris_full_tree$variable.importance
imp_df <- data.frame(
Feature = names(imp),
Importance = as.numeric(imp)
) |> arrange(Importance)
ggplot(imp_df, aes(x = reorder(Feature, Importance), y = Importance)) +
geom_col(fill = "steelblue") +
coord_flip() +
labs(title = "Feature Importance — Iris Classification Tree",
subtitle = "Total reduction in Gini impurity attributed to each feature",
x = "Feature", y = "Importance (Total Gini Reduction)") +
theme_minimal(base_size = 13)# Access the raw importance values
imp <- iris_full_tree$variable.importance
imp_df <- data.frame(
Feature = names(imp),
Importance = round(imp, 2),
Pct = round(100 * imp / sum(imp), 1)
) |> arrange(desc(Importance))
knitr::kable(imp_df, col.names = c("Feature", "Importance (Gini Reduction)", "% of Total"),
caption = "Feature importance values from the iris classification tree")| Feature | Importance (Gini Reduction) | % of Total | |
|---|---|---|---|
| Petal.Width | Petal.Width | 88.97 | 34.2 |
| Petal.Length | Petal.Length | 81.34 | 31.2 |
| Sepal.Length | Sepal.Length | 54.10 | 20.8 |
| Sepal.Width | Sepal.Width | 36.01 | 13.8 |
Interpretation: Petal.Length and
Petal.Width dominate — which matches what we saw in the
decision boundaries plot, where petal measurements drew clear, clean
boundaries between species. Sepal.Length and
Sepal.Width contribute very little.
# Regression tree on mtcars: predict fuel efficiency (mpg) from car attributes
reg_tree_cars <- rpart(mpg ~ ., data = mtcars, method = "anova",
control = rpart.control(cp = 0.01))
imp_reg <- reg_tree_cars$variable.importance
imp_reg_df <- data.frame(
Feature = names(imp_reg),
Importance = as.numeric(imp_reg)
) |> arrange(Importance)
ggplot(imp_reg_df, aes(x = reorder(Feature, Importance), y = Importance)) +
geom_col(fill = "coral") +
coord_flip() +
labs(title = "Feature Importance — MPG Regression Tree (mtcars)",
subtitle = "Total reduction in SSE attributed to each feature",
x = "Feature", y = "Importance (Total SSE Reduction)") +
theme_minimal(base_size = 13)rpart.plot(reg_tree_cars, type = 4, extra = 101, box.palette = "RdYlGn",
main = "Regression Tree: Predicting MPG (mtcars)")Key insight from the tree: cyl (number
of cylinders) is the dominant first split, and disp (engine
displacement) refines predictions further — consistent with automotive
intuition that engine size is the primary driver of fuel efficiency.
Note: Decision trees perform automatic feature selection — features that provide no impurity reduction simply never appear in the tree, and their importance is zero. Unlike regression models, you do not need to manually remove uninformative features. [2]
Feature importance tells you which features matter. Partial Dependence Plots (PDPs) tell you how a feature affects the prediction — its direction and shape of influence — while marginalizing (averaging) over all other features.
For a feature \(x_j\), the partial dependence function is:
\[\hat{f}_{x_j}(x_j) = \frac{1}{n} \sum_{i=1}^{n} \hat{f}(x_j, \mathbf{x}_{i,-j})\]
Where \(\mathbf{x}_{i,-j}\) means “all features except \(x_j\), held at their observed values.”
In plain terms: for each possible value of \(x_j\), we predict on the entire dataset with that value forced, and average the predictions. This isolates the marginal effect of \(x_j\).
# Manual PDP: for each unique value of a feature, force that value across
# all observations and average the predictions
manual_pdp <- function(model, data, feature, grid_n = 50) {
vals <- seq(min(data[[feature]]), max(data[[feature]]), length.out = grid_n)
yhat <- sapply(vals, function(v) {
d_temp <- data
d_temp[[feature]] <- v
mean(predict(model, d_temp))
})
data.frame(x = vals, yhat = yhat)
}
pd_cyl <- manual_pdp(reg_tree_cars, mtcars, "cyl", grid_n = 30)
pd_disp <- manual_pdp(reg_tree_cars, mtcars, "disp", grid_n = 60)
p_cyl <- ggplot(pd_cyl, aes(x = x, y = yhat)) +
geom_step(color = "coral", linewidth = 1.4) +
geom_point(color = "coral", size = 2) +
labs(title = "PDP: Number of Cylinders",
subtitle = "Avg predicted MPG with cyl forced to each value",
x = "cyl (cylinders)", y = "Predicted MPG") +
theme_minimal(base_size = 12)
p_disp <- ggplot(pd_disp, aes(x = x, y = yhat)) +
geom_step(color = "steelblue", linewidth = 1.4) +
labs(title = "PDP: Engine Displacement",
subtitle = "Step-wise pattern = tree's rigid rectangular regions",
x = "disp (cubic inches)", y = "Predicted MPG") +
theme_minimal(base_size = 12)
grid.arrange(p_cyl, p_disp, ncol = 2)Reading the PDPs:
cyl (cylinders): Clear step-function —
4-cylinder cars average ~26 MPG, 6-cylinder ~20 MPG, 8-cylinder ~15 MPG.
The tree learned three distinct prediction regions.disp (displacement): As displacement
increases, predicted MPG drops sharply up to ~200 cu.in., then levels
off. The step-function shape is the signature of a tree model — compare
to a smooth LOESS curve that a regression model would produce.# Manual PDP for classification: average P(virginica) with one feature forced
manual_pdp_class <- function(model, data, feature, class_name, grid_n = 60) {
vals <- seq(min(data[[feature]]), max(data[[feature]]), length.out = grid_n)
yhat <- sapply(vals, function(v) {
d_temp <- data
d_temp[[feature]] <- v
mean(predict(model, d_temp, type = "prob")[, class_name])
})
data.frame(x = vals, yhat = yhat)
}
pd_petal_l <- manual_pdp_class(iris_full_tree, iris, "Petal.Length", "virginica")
pd_petal_w <- manual_pdp_class(iris_full_tree, iris, "Petal.Width", "virginica")
p1 <- ggplot(pd_petal_l, aes(x = x, y = yhat)) +
geom_step(color = "#E15759", linewidth = 1.4) +
geom_hline(yintercept = 0.5, linetype = "dashed", color = "gray50") +
labs(title = "PDP: Petal Length → P(virginica)",
subtitle = "Probability flips from ~0 to ~1 near 5 cm",
x = "Petal Length (cm)", y = "P(Species = virginica)") +
ylim(0, 1) + theme_minimal(base_size = 12)
p2 <- ggplot(pd_petal_w, aes(x = x, y = yhat)) +
geom_step(color = "#4E79A7", linewidth = 1.4) +
geom_hline(yintercept = 0.5, linetype = "dashed", color = "gray50") +
labs(title = "PDP: Petal Width → P(virginica)",
subtitle = "Probability rises sharply above 1.75 cm",
x = "Petal Width (cm)", y = "P(Species = virginica)") +
ylim(0, 1) + theme_minimal(base_size = 12)
grid.arrange(p1, p2, ncol = 2)PDPs can also show the joint effect of two features simultaneously — useful for detecting interaction effects:
# Manual 2D PDP: grid over both features, average P(virginica) across all obs
pl_grid <- seq(min(iris$Petal.Length), max(iris$Petal.Length), length.out = 40)
pw_grid <- seq(min(iris$Petal.Width), max(iris$Petal.Width), length.out = 40)
grid_2d <- expand.grid(Petal.Length = pl_grid, Petal.Width = pw_grid)
grid_2d$yhat <- apply(grid_2d, 1, function(row) {
d_temp <- iris
d_temp$Petal.Length <- row["Petal.Length"]
d_temp$Petal.Width <- row["Petal.Width"]
mean(predict(iris_full_tree, d_temp, type = "prob")[, "virginica"])
})
ggplot(grid_2d, aes(x = Petal.Length, y = Petal.Width, fill = yhat)) +
geom_tile() +
scale_fill_gradient2(low = "steelblue", mid = "white", high = "tomato",
midpoint = 0.5, name = "P(virginica)") +
labs(title = "2D Partial Dependence: Petal Length × Petal Width",
subtitle = "Joint marginal effect on P(Species = virginica)",
x = "Petal Length (cm)", y = "Petal Width (cm)") +
theme_minimal(base_size = 13)Key limitation highlighted by Boehmke & Greenwell [2]: Decision trees produce rigid, non-smooth prediction surfaces — the step-function nature is clearly visible in the 2D PDP as hard rectangular boundaries. MARS and neural networks produce smoother surfaces. This is a fundamental characteristic of tree-based models.
| Tool | Answers | Shows Direction? | Shows Shape? |
|---|---|---|---|
| Feature Importance | Which features matter? | No | No |
| PDP (1D) | How does feature X affect predictions? | Yes | Yes |
| PDP (2D) | How do X and Y interact? | Yes | Yes |
Always use both together: importance tells you which features to investigate; PDPs tell you how they behave. [2]
Test your understanding with these questions from the textbook:
1. What is the approximate depth of a Decision Tree trained without restrictions on a training set with 1 million instances?
Hint: Trees are approximately balanced, so depth ≈ log₂(1,000,000) ≈ ?
2. Is a node’s Gini impurity generally lower or greater than its parent’s Gini impurity? Is it always the case, or just generally?
3. Your Decision Tree is overfitting the training
set. Should you try increasing or
decreasing max_depth?
4. Your Decision Tree is underfitting. Would scaling the input features help? Why or why not?
5. If training takes 1 hour on 1 million instances, roughly how long would it take on 10 million instances?
Hint: Training complexity is O(n × m log₂ m). How does this scale?
## Answers (unhide to reveal):
## 1. ~20 levels (log2(1,000,000) ≈ 20)
## 2. Generally LOWER — CART minimizes child impurity. Always lower for training data,
## but may not improve much if the gain is near zero.
## 3. DECREASE max_depth — a shallower tree has less flexibility to overfit.
## 4. NO — Decision Trees do not use distances or gradients, so feature scale is irrelevant.
## 5. ~11.7 hours — 10x more data → 10 * log2(10M)/log2(1M) ≈ 11.7x longer.
6. You build a decision tree on a dataset and notice
that Age is used in 5 different splits while
Income is used in only 1. Which feature will have higher
importance, and why?
7. A PDP for a regression tree on feature
x shows a flat line at ŷ = 42 from x = 10 to x = 50, then a
sharp jump to ŷ = 65 at x = 51, then flat again. What does this tell you
about the tree’s split structure?
The following references were used in the development of this tutorial. Citations are formatted in IEEE style.
[1] A. Géron, Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow, 2nd ed. Sebastopol, CA, USA: O’Reilly Media, 2019.
[2] B. C. Boehmke and B. M. Greenwell, Hands-On Machine Learning with R. Boca Raton, FL, USA: CRC Press, 2020. [Online]. Available: https://bradleyboehmke.github.io/HOML/DT.html
[3] L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone, Classification and Regression Trees. Boca Raton, FL, USA: Chapman & Hall/CRC, 1984.
[4] T. M. Therneau and E. J. Atkinson, “An Introduction to Recursive Partitioning Using the RPART Routines,” Mayo Foundation, Rochester, MN, USA, Tech. Rep., 1997.
[5] B. M. Greenwell, “pdp: An R Package for Constructing Partial Dependence Plots,” The R Journal, vol. 9, no. 1, pp. 421–436, 2017. (Partial dependence methodology implemented manually in this tutorial via grid prediction.)
[6] B. M. Greenwell and B. C. Boehmke, “Variable
Importance Plots — An Introduction to the vip Package,” The R
Journal, vol. 12, no. 1, pp. 343–366, 2020. (Feature importance
extracted directly from rpart’s
$variable.importance field.)
[7] R. A. Fisher, “The Use of Multiple Measurements in Taxonomic Problems,” Annals of Eugenics, vol. 7, no. 2, pp. 179–188, 1936.