Pseudocode Algorithm Collection
Last Updated: December 09, 2025
This pseudocode implementation references concepts from:
D:\Github\pseudocode_algorithm\papers\randomforest2001.pdfNote: This pseudocode is an original educational implementation based on Leo Breiman’s seminal 2001 paper introducing Random Forests and general algorithmic principles from computer science literature.
Random Forest (Breiman, 2001) is an ensemble learning method that constructs multiple decision trees during training and outputs:
The algorithm was introduced by Leo Breiman in his landmark 2001 paper published in Machine Learning journal, revolutionizing ensemble methods in machine learning.
Random Forest relies on three fundamental concepts from Breiman (2001):
These innovations reduce variance without increasing bias, creating a powerful ensemble method.
INPUT:
D: training dataset with n samples, m featuresT: number of trees to buildk: number of features to consider at each splitmax_depth: maximum depth of each treemin_samples: minimum samples required to splitOUTPUT:
Forest: collection of T trained decision treesPROCEDURE RandomForestTrain(D, T, k, max_depth, min_samples):
Forest ← empty list
FOR i = 1 TO T DO
// Bootstrap sampling: sample n instances with replacement
D_bootstrap ← BootstrapSample(D, n)
// Build decision tree with random feature selection
tree ← BuildDecisionTree(D_bootstrap, k, max_depth, min_samples)
// Add tree to forest
Forest.append(tree)
END FOR
RETURN Forest
END PROCEDURE
The decision tree building process is recursive:
PROCEDURE BuildDecisionTree(D, k, max_depth, min_samples):
RETURN BuildNode(D, 0, k, max_depth, min_samples)
END PROCEDURE
PROCEDURE BuildNode(D, depth, k, max_depth, min_samples):
// Create a new node
node ← new TreeNode()
// Base cases - create leaf node
IF depth ≥ max_depth OR |D| < min_samples OR AllSameLabel(D) THEN
node.is_leaf ← TRUE
node.prediction ← MajorityClass(D) // for classification
RETURN node
END IF
// Randomly select k features from m total features
features_subset ← RandomlySelectKFeatures(D, k)
// Find best split among the k features
best_feature, best_threshold ← FindBestSplit(D, features_subset)
// If no good split found, make leaf
IF best_feature = NULL THEN
node.is_leaf ← TRUE
node.prediction ← MajorityClass(D)
RETURN node
END IF
// Set node split criteria
node.is_leaf ← FALSE
node.feature ← best_feature
node.threshold ← best_threshold
// Split data
D_left ← {samples in D where feature ≤ threshold}
D_right ← {samples in D where feature > threshold}
// Recursively build child nodes
node.left ← BuildNode(D_left, depth + 1, k, max_depth, min_samples)
node.right ← BuildNode(D_right, depth + 1, k, max_depth, min_samples)
RETURN node
END PROCEDURE
PROCEDURE RandomForestPredict(Forest, x):
predictions ← empty list
// Get prediction from each tree
FOR EACH tree IN Forest DO
prediction ← PredictTree(tree.root, x)
predictions.append(prediction)
END FOR
// For classification: return majority vote
final_prediction ← MajorityVote(predictions)
// For regression: return mean
// final_prediction ← Mean(predictions)
RETURN final_prediction
END PROCEDURE
PROCEDURE PredictTree(node, x):
IF node.is_leaf THEN
RETURN node.prediction
END IF
// Navigate tree based on feature value
IF x[node.feature] ≤ node.threshold THEN
RETURN PredictTree(node.left, x)
ELSE
RETURN PredictTree(node.right, x)
END IF
END PROCEDURE
Bootstrap sampling creates a training set by sampling with replacement:
PROCEDURE BootstrapSample(D, n):
D_bootstrap ← empty dataset
FOR i = 1 TO n DO
// Sample with replacement
random_index ← RandomInt(0, n-1)
D_bootstrap.append(D[random_index])
END FOR
RETURN D_bootstrap
END PROCEDURE
PROCEDURE RandomlySelectKFeatures(D, k):
all_features ← GetAllFeatures(D)
m ← |all_features|
selected_features ← RandomSample(all_features, min(k, m))
RETURN selected_features
END PROCEDURE
PROCEDURE FindBestSplit(D, features_subset):
best_gain ← -∞
best_feature ← NULL
best_threshold ← NULL
FOR EACH feature IN features_subset DO
// Try different threshold values
thresholds ← GetUniqueValues(D, feature)
FOR EACH threshold IN thresholds DO
D_left ← {samples where feature ≤ threshold}
D_right ← {samples where feature > threshold}
// Calculate information gain (or Gini impurity reduction)
gain ← CalculateInformationGain(D, D_left, D_right)
IF gain > best_gain THEN
best_gain ← gain
best_feature ← feature
best_threshold ← threshold
END IF
END FOR
END FOR
RETURN best_feature, best_threshold
END PROCEDURE
PROCEDURE CalculateInformationGain(D, D_left, D_right):
n ← |D|
n_left ← |D_left|
n_right ← |D_right|
// Calculate Gini impurity
gini_parent ← GiniImpurity(D)
gini_children ← (n_left/n) × GiniImpurity(D_left) +
(n_right/n) × GiniImpurity(D_right)
gain ← gini_parent - gini_children
RETURN gain
END PROCEDURE
PROCEDURE GiniImpurity(D):
class_counts ← CountClasses(D)
n ← |D|
gini ← 1.0
FOR EACH class_count IN class_counts DO
probability ← class_count / n
gini ← gini - probability²
END FOR
RETURN gini
END PROCEDURE
PROCEDURE MajorityVote(predictions):
class_counts ← CountOccurrences(predictions)
majority_class ← ClassWithMaxCount(class_counts)
RETURN majority_class
END PROCEDURE
Training Complexity:
\[O(T \times n \times \log(n) \times k \times d)\]
Where:
Prediction Complexity:
\[O(T \times d)\]
Where:
Space Complexity: \(O(T \times d)\)
Where:
Random Forest offers several key advantages:
Some limitations to consider:
# Load required libraries
library(randomForest)
library(caret)
# Load iris dataset
data(iris)
str(iris)## 'data.frame': 150 obs. of 5 variables:
## $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
## $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
## $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
## $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
## $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
# Set seed for reproducibility
set.seed(123)
# Split data
trainIndex <- createDataPartition(iris$Species, p = 0.7, list = FALSE)
trainData <- iris[trainIndex, ]
testData <- iris[-trainIndex, ]
# Train Random Forest
rf_model <- randomForest(
Species ~ .,
data = trainData,
ntree = 100, # Number of trees
mtry = 2, # Features to consider at each split
importance = TRUE
)
print(rf_model)##
## Call:
## randomForest(formula = Species ~ ., data = trainData, ntree = 100, mtry = 2, importance = TRUE)
## Type of random forest: classification
## Number of trees: 100
## No. of variables tried at each split: 2
##
## OOB estimate of error rate: 3.81%
## Confusion matrix:
## setosa versicolor virginica class.error
## setosa 35 0 0 0.00000000
## versicolor 0 34 1 0.02857143
## virginica 0 3 32 0.08571429
# Make predictions
predictions <- predict(rf_model, testData)
# Confusion Matrix
confusionMatrix(predictions, testData$Species)## Confusion Matrix and Statistics
##
## Reference
## Prediction setosa versicolor virginica
## setosa 15 0 0
## versicolor 0 14 2
## virginica 0 1 13
##
## Overall Statistics
##
## Accuracy : 0.9333
## 95% CI : (0.8173, 0.986)
## No Information Rate : 0.3333
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: setosa Class: versicolor Class: virginica
## Sensitivity 1.0000 0.9333 0.8667
## Specificity 1.0000 0.9333 0.9667
## Pos Pred Value 1.0000 0.8750 0.9286
## Neg Pred Value 1.0000 0.9655 0.9355
## Prevalence 0.3333 0.3333 0.3333
## Detection Rate 0.3333 0.3111 0.2889
## Detection Prevalence 0.3333 0.3556 0.3111
## Balanced Accuracy 1.0000 0.9333 0.9167
# Plot error vs number of trees
plot(rf_model, main = "Error Rate vs Number of Trees")
legend("topright", colnames(rf_model$err.rate),
col = 1:4, cex = 0.8, fill = 1:4)Key hyperparameters to tune:
Random Forest has a built-in validation mechanism:
Best suited for:
Not ideal for:
Random Forest is widely used in:
Several variants exist:
| Algorithm | Interpretability | Speed | Accuracy | Overfitting Risk |
|---|---|---|---|---|
| Single Tree | High | Fast | Low-Medium | High |
| Random Forest | Low | Moderate | High | Low |
| Gradient Boosting | Low | Slow | Very High | Medium |
| Neural Network | Very Low | Slow | Very High | Medium-High |
In this presentation, we covered:
Online Resources:
Books:
Thank you for your attention!
For questions or discussions about this pseudocode implementation, please refer to the original text file or the referenced materials.