Random Forest Algorithm - Pseudocode

Pseudocode Algorithm Collection

Last Updated: December 09, 2025

Reference Material

This pseudocode implementation references concepts from:

Note: This pseudocode is an original educational implementation based on Leo Breiman’s seminal 2001 paper introducing Random Forests and general algorithmic principles from computer science literature.

Random Forest Algorithm - Overview

Random Forest (Breiman, 2001) is an ensemble learning method that constructs multiple decision trees during training and outputs:

The algorithm was introduced by Leo Breiman in his landmark 2001 paper published in Machine Learning journal, revolutionizing ensemble methods in machine learning.

Key Concepts

Random Forest relies on three fundamental concepts from Breiman (2001):

  1. Bootstrap Aggregating (Bagging) - Introduced by Breiman (1996), each tree trains on a bootstrap sample drawn with replacement
  2. Random Feature Selection - At each node split, only a random subset of features is considered, introducing diversity
  3. Ensemble Voting - Final prediction aggregates all tree predictions through majority vote (classification) or averaging (regression)

These innovations reduce variance without increasing bias, creating a powerful ensemble method.

Algorithm: Random Forest Training

INPUT:

OUTPUT:

Pseudocode: Random Forest Training

PROCEDURE RandomForestTrain(D, T, k, max_depth, min_samples):
  Forest ← empty list
  
  FOR i = 1 TO T DO
    // Bootstrap sampling: sample n instances with replacement
    D_bootstrap ← BootstrapSample(D, n)
    
    // Build decision tree with random feature selection
    tree ← BuildDecisionTree(D_bootstrap, k, max_depth, min_samples)
    
    // Add tree to forest
    Forest.append(tree)
  END FOR
  
  RETURN Forest
END PROCEDURE

Algorithm: Build Decision Tree

The decision tree building process is recursive:

PROCEDURE BuildDecisionTree(D, k, max_depth, min_samples):
  RETURN BuildNode(D, 0, k, max_depth, min_samples)
END PROCEDURE

Pseudocode: BuildNode (Part 1)

PROCEDURE BuildNode(D, depth, k, max_depth, min_samples):
  // Create a new node
  node ← new TreeNode()
  
  // Base cases - create leaf node
  IF depth ≥ max_depth OR |D| < min_samples OR AllSameLabel(D) THEN
    node.is_leaf ← TRUE
    node.prediction ← MajorityClass(D)  // for classification
    RETURN node
  END IF
  
  // Randomly select k features from m total features
  features_subset ← RandomlySelectKFeatures(D, k)
  
  // Find best split among the k features
  best_feature, best_threshold ← FindBestSplit(D, features_subset)

Pseudocode: BuildNode (Part 2)

  // If no good split found, make leaf
  IF best_feature = NULL THEN
    node.is_leaf ← TRUE
    node.prediction ← MajorityClass(D)
    RETURN node
  END IF
  
  // Set node split criteria
  node.is_leaf ← FALSE
  node.feature ← best_feature
  node.threshold ← best_threshold
  
  // Split data
  D_left ← {samples in D where feature ≤ threshold}
  D_right ← {samples in D where feature > threshold}
  
  // Recursively build child nodes
  node.left ← BuildNode(D_left, depth + 1, k, max_depth, min_samples)
  node.right ← BuildNode(D_right, depth + 1, k, max_depth, min_samples)
  
  RETURN node
END PROCEDURE

Algorithm: Random Forest Prediction

PROCEDURE RandomForestPredict(Forest, x):
  predictions ← empty list
  
  // Get prediction from each tree
  FOR EACH tree IN Forest DO
    prediction ← PredictTree(tree.root, x)
    predictions.append(prediction)
  END FOR
  
  // For classification: return majority vote
  final_prediction ← MajorityVote(predictions)
  
  // For regression: return mean
  // final_prediction ← Mean(predictions)
  
  RETURN final_prediction
END PROCEDURE

Pseudocode: Tree Prediction

PROCEDURE PredictTree(node, x):
  IF node.is_leaf THEN
    RETURN node.prediction
  END IF
  
  // Navigate tree based on feature value
  IF x[node.feature] ≤ node.threshold THEN
    RETURN PredictTree(node.left, x)
  ELSE
    RETURN PredictTree(node.right, x)
  END IF
END PROCEDURE

Helper Procedure: Bootstrap Sampling

Bootstrap sampling creates a training set by sampling with replacement:

PROCEDURE BootstrapSample(D, n):
  D_bootstrap ← empty dataset
  FOR i = 1 TO n DO
    // Sample with replacement
    random_index ← RandomInt(0, n-1)
    D_bootstrap.append(D[random_index])
  END FOR
  RETURN D_bootstrap
END PROCEDURE

Helper Procedure: Random Feature Selection

PROCEDURE RandomlySelectKFeatures(D, k):
  all_features ← GetAllFeatures(D)
  m ← |all_features|
  selected_features ← RandomSample(all_features, min(k, m))
  RETURN selected_features
END PROCEDURE

Helper Procedure: Find Best Split (Part 1)

PROCEDURE FindBestSplit(D, features_subset):
  best_gain ← -∞
  best_feature ← NULL
  best_threshold ← NULL
  
  FOR EACH feature IN features_subset DO
    // Try different threshold values
    thresholds ← GetUniqueValues(D, feature)
    
    FOR EACH threshold IN thresholds DO
      D_left ← {samples where feature ≤ threshold}
      D_right ← {samples where feature > threshold}
      
      // Calculate information gain (or Gini impurity reduction)
      gain ← CalculateInformationGain(D, D_left, D_right)

Helper Procedure: Find Best Split (Part 2)

      IF gain > best_gain THEN
        best_gain ← gain
        best_feature ← feature
        best_threshold ← threshold
      END IF
    END FOR
  END FOR
  
  RETURN best_feature, best_threshold
END PROCEDURE

Helper Procedure: Calculate Information Gain

PROCEDURE CalculateInformationGain(D, D_left, D_right):
  n ← |D|
  n_left ← |D_left|
  n_right ← |D_right|
  
  // Calculate Gini impurity
  gini_parent ← GiniImpurity(D)
  gini_children ← (n_left/n) × GiniImpurity(D_left) + 
                   (n_right/n) × GiniImpurity(D_right)
  
  gain ← gini_parent - gini_children
  RETURN gain
END PROCEDURE

Helper Procedure: Gini Impurity

PROCEDURE GiniImpurity(D):
  class_counts ← CountClasses(D)
  n ← |D|
  gini ← 1.0
  
  FOR EACH class_count IN class_counts DO
    probability ← class_count / n
    gini ← gini - probability²
  END FOR
  
  RETURN gini
END PROCEDURE

Helper Procedure: Majority Vote

PROCEDURE MajorityVote(predictions):
  class_counts ← CountOccurrences(predictions)
  majority_class ← ClassWithMaxCount(class_counts)
  RETURN majority_class
END PROCEDURE

Complexity Analysis: Time Complexity

Training Complexity:

\[O(T \times n \times \log(n) \times k \times d)\]

Where:

Prediction Complexity:

\[O(T \times d)\]

Where:

Complexity Analysis: Space Complexity

Space Complexity: \(O(T \times d)\)

Where:

Advantages of Random Forest

Random Forest offers several key advantages:

Disadvantages of Random Forest

Some limitations to consider:

Example in R: Loading Data

# Load required libraries
library(randomForest)
library(caret)

# Load iris dataset
data(iris)
str(iris)
## 'data.frame':    150 obs. of  5 variables:
##  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
##  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
##  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
##  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

Example in R: Training Random Forest

# Set seed for reproducibility
set.seed(123)

# Split data
trainIndex <- createDataPartition(iris$Species, p = 0.7, list = FALSE)
trainData <- iris[trainIndex, ]
testData <- iris[-trainIndex, ]

# Train Random Forest
rf_model <- randomForest(
  Species ~ ., 
  data = trainData,
  ntree = 100,        # Number of trees
  mtry = 2,           # Features to consider at each split
  importance = TRUE
)

print(rf_model)
## 
## Call:
##  randomForest(formula = Species ~ ., data = trainData, ntree = 100,      mtry = 2, importance = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 100
## No. of variables tried at each split: 2
## 
##         OOB estimate of  error rate: 3.81%
## Confusion matrix:
##            setosa versicolor virginica class.error
## setosa         35          0         0  0.00000000
## versicolor      0         34         1  0.02857143
## virginica       0          3        32  0.08571429

Example in R: Model Performance

# Make predictions
predictions <- predict(rf_model, testData)

# Confusion Matrix
confusionMatrix(predictions, testData$Species)
## Confusion Matrix and Statistics
## 
##             Reference
## Prediction   setosa versicolor virginica
##   setosa         15          0         0
##   versicolor      0         14         2
##   virginica       0          1        13
## 
## Overall Statistics
##                                          
##                Accuracy : 0.9333         
##                  95% CI : (0.8173, 0.986)
##     No Information Rate : 0.3333         
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.9            
##                                          
##  Mcnemar's Test P-Value : NA             
## 
## Statistics by Class:
## 
##                      Class: setosa Class: versicolor Class: virginica
## Sensitivity                 1.0000            0.9333           0.8667
## Specificity                 1.0000            0.9333           0.9667
## Pos Pred Value              1.0000            0.8750           0.9286
## Neg Pred Value              1.0000            0.9655           0.9355
## Prevalence                  0.3333            0.3333           0.3333
## Detection Rate              0.3333            0.3111           0.2889
## Detection Prevalence        0.3333            0.3556           0.3111
## Balanced Accuracy           1.0000            0.9333           0.9167

Example in R: Feature Importance

# Plot variable importance
varImpPlot(rf_model, main = "Random Forest - Feature Importance")

Example in R: Visualize Trees

# Plot error vs number of trees
plot(rf_model, main = "Error Rate vs Number of Trees")
legend("topright", colnames(rf_model$err.rate), 
       col = 1:4, cex = 0.8, fill = 1:4)

Practical Considerations: Hyperparameters

Key hyperparameters to tune:

Practical Considerations: Out-of-Bag (OOB) Error

Random Forest has a built-in validation mechanism:

Practical Considerations: When to Use

Best suited for:

Not ideal for:

Applications of Random Forest

Random Forest is widely used in:

  1. Classification: Image classification, spam detection, disease diagnosis
  2. Regression: Price prediction, demand forecasting
  3. Feature Selection: Identifying important variables
  4. Outlier Detection: Using proximity measures
  5. Missing Value Imputation: Using iterative methods
  6. Imbalanced Datasets: With class weights

Variations and Extensions

Several variants exist:

Comparison with Other Algorithms

Algorithm Interpretability Speed Accuracy Overfitting Risk
Single Tree High Fast Low-Medium High
Random Forest Low Moderate High Low
Gradient Boosting Low Slow Very High Medium
Neural Network Very Low Slow Very High Medium-High

Summary

In this presentation, we covered:

References

  1. Breiman, L. (2001). “Random Forests.” Machine Learning, 45(1), 5-32.
  2. Heineman, G.T., Pollice, G., & Selkow, S. (2009). “Algorithms in a Nutshell.” O’Reilly Media, Inc.
  3. Hastie, T., Tibshirani, R., & Friedman, J. (2009). “The Elements of Statistical Learning.” Springer.
  4. James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). “An Introduction to Statistical Learning.” Springer.
  5. Liaw, A., & Wiener, M. (2002). “Classification and Regression by randomForest.” R News, 2(3), 18-22.

Further Resources

Online Resources:

Books:

Questions?

Thank you for your attention!

For questions or discussions about this pseudocode implementation, please refer to the original text file or the referenced materials.