Pseudocode Algorithm Examples
December 9, 2025
This presentation is based on:
Chen, T., & Guestrin, C. (2016). “XGBoost: A Scalable Tree
Boosting System.” In Proceedings of the 22nd ACM SIGKDD International
Conference on Knowledge Discovery and Data Mining (KDD ’16), 785-794.
Original paper available at:
D:\Github\pseudocode_algorithm\papers\xgboost_chenAemb.pdf
Heineman, G.T., Pollice, G., & Selkow, S. (2009). “Algorithms in a Nutshell” O’Reilly Media, Inc.
The pseudocode implementations presented are original educational materials inspired by these foundational works.
XGBoost (Extreme Gradient Boosting) is an optimized distributed gradient boosting library designed for efficiency, flexibility, and portability. It was introduced by Tianqi Chen and Carlos Guestrin in their award-winning 2016 KDD paper, becoming one of the most influential machine learning algorithms in data science competitions and industry applications.
The algorithm implements machine learning algorithms under the Gradient Boosting framework, providing a parallel tree boosting that solves many data science problems in a fast and accurate way.
XGBoost relies on several fundamental innovations from Chen & Guestrin (2016):
These innovations create a highly efficient and accurate gradient boosting system.
XGBoost consists of three main components:
INPUT:
D: training dataset {(x₁, y₁), (x₂, y₂), ..., (xₙ, yₙ)}
L: loss function (squared error, log loss, etc.)
T: number of boosting rounds (trees)
η: learning rate (shrinkage parameter)
λ: L2 regularization term
γ: minimum loss reduction for split
max_depth: maximum tree depth
OUTPUT:
Model: ensemble of T trees with predictions
PROCEDURE XGBoostTrain(D, L, T, η, λ, γ, max_depth):
// Initialize predictions with constant value
F₀(x) ← argmin_c ∑ᵢ L(yᵢ, c)
// Initialize prediction for each sample
FOR i = 1 TO n DO
pred[i] ← F₀(xᵢ)
END FOR
trees ← empty list
The initialization step finds the optimal constant prediction that minimizes the loss function.
// Build T trees sequentially
FOR t = 1 TO T DO
// Compute first and second order gradients
g ← empty array[n]
h ← empty array[n]
FOR i = 1 TO n DO
// First order gradient (negative gradient)
g[i] ← -∂L(yᵢ, pred[i])/∂pred[i]
// Second order gradient (Hessian)
h[i] ← ∂²L(yᵢ, pred[i])/∂pred[i]²
END FOR
Computing both first and second derivatives enables more accurate optimization.
// Build tree using gradient statistics
tree_t ← BuildTree(D, g, h, max_depth, λ, γ)
// Update predictions with learning rate
FOR i = 1 TO n DO
pred[i] ← pred[i] + η × tree_t.Predict(xᵢ)
END FOR
trees.append(tree_t)
END FOR
RETURN trees
END PROCEDURE
The learning rate η controls the contribution of each tree, preventing overfitting.
The tree building procedure uses gradient statistics to find optimal splits:
PROCEDURE BuildTree(D, g, h, max_depth, λ, γ):
root ← BuildNode(D, g, h, 0, max_depth, λ, γ)
RETURN root
END PROCEDURE
Each node is built recursively using gradient and Hessian information.
PROCEDURE BuildNode(D, g, h, depth, max_depth, λ, γ):
node ← new TreeNode()
// Calculate node weight using gradient statistics
G ← ∑ᵢ g[i] for all i in D
H ← ∑ᵢ h[i] for all i in D
node.weight ← -G / (H + λ)
// Calculate current node score
current_score ← -0.5 × G² / (H + λ)
The node weight is computed using the regularized gradient statistics.
// Base cases - create leaf
IF depth ≥ max_depth OR |D| < min_samples THEN
node.is_leaf ← TRUE
RETURN node
END IF
Stopping criteria prevent trees from growing too deep.
// Find best split
best_gain ← -∞
best_feature ← NULL
best_threshold ← NULL
best_D_left ← NULL
best_D_right ← NULL
FOR EACH feature IN features DO
// Sort samples by feature value
sorted_indices ← SortByFeature(D, feature)
G_L ← 0 // Left gradient sum
H_L ← 0 // Left hessian sum
We evaluate all features to find the one that maximizes gain.
FOR EACH split_point IN sorted_indices DO
// Add current sample to left partition
i ← split_point
G_L ← G_L + g[i]
H_L ← H_L + h[i]
G_R ← G - G_L // Right gradient sum
H_R ← H - H_L // Right hessian sum
// Calculate gain using regularized objective
score_left ← -0.5 × G_L² / (H_L + λ)
score_right ← -0.5 × G_R² / (H_R + λ)
gain ← score_left + score_right - current_score - γ
The gain formula includes regularization term λ and complexity penalty γ.
// Update best split if gain is improved
IF gain > best_gain AND H_L ≥ 1 AND H_R ≥ 1 THEN
best_gain ← gain
best_feature ← feature
best_threshold ← (D[split_point][feature] +
D[split_point+1][feature]) / 2
best_D_left ← left partition samples
best_D_right ← right partition samples
END IF
END FOR
END FOR
We ensure both child nodes have sufficient samples (Hessian weight ≥ 1).
// If no good split found (gain ≤ 0), create leaf
IF best_gain ≤ 0 OR best_feature = NULL THEN
node.is_leaf ← TRUE
RETURN node
END IF
// Set split criteria
node.is_leaf ← FALSE
node.feature ← best_feature
node.threshold ← best_threshold
If no split improves the objective, we create a leaf node.
// Create gradient/hessian arrays for child nodes
g_left, h_left ← ExtractGradients(g, h, best_D_left)
g_right, h_right ← ExtractGradients(g, h, best_D_right)
// Recursively build child nodes
node.left ← BuildNode(best_D_left, g_left, h_left,
depth + 1, max_depth, λ, γ)
node.right ← BuildNode(best_D_right, g_right, h_right,
depth + 1, max_depth, λ, γ)
RETURN node
END PROCEDURE
Child nodes are built recursively with their respective gradient statistics.
PROCEDURE XGBoostPredict(trees, x, η):
// Start with initial prediction (usually 0)
prediction ← 0
// Add contribution from each tree
FOR EACH tree IN trees DO
prediction ← prediction + η × TreePredict(tree, x)
END FOR
RETURN prediction
END PROCEDURE
Final prediction is the sum of all tree predictions scaled by learning rate.
PROCEDURE TreePredict(node, x):
IF node.is_leaf THEN
RETURN node.weight
END IF
// Navigate tree based on feature value
IF x[node.feature] ≤ node.threshold THEN
RETURN TreePredict(node.left, x)
ELSE
RETURN TreePredict(node.right, x)
END IF
END PROCEDURE
Tree traversal follows the learned split conditions.
PROCEDURE CalculateGradient(loss_function, y_true, y_pred):
CASE loss_function OF
"squared_error":
RETURN -(y_true - y_pred)
"logistic":
prob ← 1 / (1 + exp(-y_pred))
RETURN -(y_true - prob)
"softmax":
probs ← Softmax(y_pred)
RETURN -(y_true - probs)
END CASE
END PROCEDURE
Different loss functions have different gradient formulas.
PROCEDURE CalculateHessian(loss_function, y_true, y_pred):
CASE loss_function OF
"squared_error":
RETURN 1
"logistic":
prob ← 1 / (1 + exp(-y_pred))
RETURN prob × (1 - prob)
"softmax":
probs ← Softmax(y_pred)
RETURN probs × (1 - probs)
END CASE
END PROCEDURE
The Hessian captures the curvature of the loss function.
PROCEDURE ColumnSubsample(features, colsample_rate):
n_features ← ⌊|features| × colsample_rate⌋
selected_features ← RandomSample(features, n_features)
RETURN selected_features
END PROCEDURE
Column subsampling introduces randomness and prevents overfitting, similar to Random Forests.
PROCEDURE FindBestSplitWithMissing(D, g, h, feature):
// Try assigning missing values to left
gain_left, threshold_left ←
CalculateSplitGain(D, g, h, feature, missing_to_left=TRUE)
// Try assigning missing values to right
gain_right, threshold_right ←
CalculateSplitGain(D, g, h, feature, missing_to_right=TRUE)
// Choose direction that maximizes gain
IF gain_left > gain_right THEN
RETURN gain_left, threshold_left, direction="left"
ELSE
RETURN gain_right, threshold_right, direction="right"
END IF
END PROCEDURE
XGBoost learns the optimal default direction for missing values.
PROCEDURE PruneTree(node, γ):
IF node.is_leaf THEN
RETURN node
END IF
// Recursively prune children
node.left ← PruneTree(node.left, γ)
node.right ← PruneTree(node.right, γ)
// Calculate gain of this split
gain ← node.gain
// If gain is less than γ, convert to leaf
IF gain < γ THEN
node.is_leaf ← TRUE
node.left ← NULL
node.right ← NULL
END IF
RETURN node
END PROCEDURE
Post-pruning removes splits that don’t provide sufficient gain.
# Load data
data(iris)
X <- as.matrix(iris[, 1:4])
y <- as.numeric(iris$Species) - 1
# Create DMatrix (XGBoost's internal data structure)
dtrain <- xgb.DMatrix(data = X, label = y)
# Train XGBoost model
model <- xgboost(
data = dtrain,
max_depth = 3,
eta = 0.3, # learning rate
nrounds = 50, # number of trees
objective = "multi:softprob",
num_class = 3,
verbose = 0
)
# Make predictions
predictions <- predict(model, X)# Cross-validation to find optimal number of rounds
cv_model <- xgb.cv(
data = dtrain,
max_depth = 3,
eta = 0.3,
nrounds = 100,
nfold = 5,
objective = "multi:softprob",
num_class = 3,
early_stopping_rounds = 10,
verbose = 0
)
# Get best iteration
best_iter <- cv_model$best_iteration
print(paste("Best iteration:", best_iter))
# Train final model
final_model <- xgboost(
data = dtrain,
max_depth = 3,
eta = 0.3,
nrounds = best_iter,
objective = "multi:softprob",
num_class = 3,
verbose = 0
)# Get feature importance
importance_matrix <- xgb.importance(
feature_names = colnames(X),
model = final_model
)
# Print importance
print(importance_matrix)
# Plot feature importance
xgb.plot.importance(importance_matrix, top_n = 10)Importance metrics include:
# Define parameter grid
param_grid <- expand.grid(
max_depth = c(3, 5, 7),
eta = c(0.01, 0.1, 0.3),
subsample = c(0.7, 0.8, 1.0),
colsample_bytree = c(0.7, 0.8, 1.0),
lambda = c(0, 0.1, 1.0), # L2 regularization
gamma = c(0, 0.1, 0.5) # minimum split loss
)
# Train control
train_control <- trainControl(
method = "cv",
number = 5,
verboseIter = FALSE
)
# Tune using caret
tuned_model <- train(
x = X,
y = factor(y),
method = "xgbTree",
trControl = train_control,
tuneGrid = param_grid,
objective = "multi:softprob"
)
# Best parameters
print(tuned_model$bestTune)Time Complexity:
Space Complexity: \(O(T \times d \times \text{max_leaves})\)
Tree Parameters: - max_depth: Maximum
tree depth (3-10) - min_child_weight: Minimum sum of
Hessian (1-10) - gamma: Minimum loss reduction for split
(0-1)
Boosting Parameters: - learning_rate
(eta): Step size shrinkage (0.01-0.3) - n_estimators:
Number of boosting rounds (50-1000)
Regularization: - lambda: L2
regularization term (0-10) - alpha: L1 regularization term
(0-10)
Sampling: - subsample: Row sampling
ratio (0.5-1.0) - colsample_bytree: Column sampling ratio
(0.5-1.0)
When to use XGBoost:
When to avoid XGBoost:
XGBoost vs Random Forest:
XGBoost vs Gradient Boosting:
Common use cases:
Industry applications:
Chen & Guestrin (2016) designed XGBoost with several systems optimizations:
These optimizations make XGBoost 10x faster than existing gradient boosting implementations.
Chen, T., & Guestrin, C. (2016). XGBoost: A Scalable Tree Boosting System. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 785-794.
Friedman, J. H. (2001). Greedy function approximation: A gradient boosting machine. Annals of Statistics, 29(5), 1189-1232.
Heineman, G.T., Pollice, G., & Selkow, S. (2009). Algorithms in a Nutshell. O’Reilly Media, Inc.
XGBoost Documentation: https://xgboost.readthedocs.io/
XGBoost is a powerful gradient boosting algorithm that:
It remains one of the most successful algorithms for structured data, winning numerous machine learning competitions and powering production systems worldwide.
Key takeaway: XGBoost combines statistical learning theory with systems engineering to create a highly efficient and accurate gradient boosting implementation.