§ 01The ensemble idea
A single decision tree is a flawed oracle: deep enough to memorize the training set, shallow enough to miss the pattern. Ensemble learning works around that flaw by combining many imperfect learners into one strong learner. The key insight — sometimes called the wisdom-of-crowds theorem — is that if individual errors are not perfectly correlated, averaging cancels them out.
There are two dominant philosophies for building the crowd:
- Bagging parallel — train many models independently on different views of the data, then average. The goal is to reduce variance.
- Boosting sequential — train models one after another, each one focusing on the mistakes of the previous. The goal is to reduce bias.
§ 02Bias-variance recap
For a model f̂(x) trained on a dataset D, the expected squared error at a point x decomposes as:
- Bias how far the average prediction is from the truth. High when the model is too simple (underfitting).
- Variance how much predictions wobble across different training sets. High when the model is too flexible (overfitting).
- σ² irreducible noise — labelling noise in y. Nothing fixes this.
Slide the model complexity. Watch bias² fall and variance rise. The total error is the sum — and the sweet spot is rarely at either extreme. Bagging shifts the variance curve down; boosting shifts the bias curve down.
§ 03Bagging vs Boosting at a glance
| Bagging | Boosting | |
|---|---|---|
| Training order | Parallel — trees independent | Sequential — each tree depends on the last |
| Data per learner | Bootstrap sample (with replacement) | Full data, re-weighted by errors |
| Base learners | Low-bias, high-variance (deep trees) | High-bias, low-variance (stumps / shallow trees) |
| Aggregation | Simple average / majority vote | Weighted sum (each learner has a weight) |
| Primary effect | ↓ Variance | ↓ Bias (and some variance) |
| Risk if overdone | Cannot overfit much (asymptotes) | Can overfit if iterations too high |
| Canonical examples | Random Forest, Extra Trees | AdaBoost, Gradient Boosting, XGBoost, LightGBM |
§ 04The bagging procedure
Bagging — short for Bootstrap Aggregating — was introduced by Breiman in 1996. The recipe is three steps:
step 2 · fit train model f̂_b on D_b, independently, for b = 1..B
step 3 · aggregate regression: f̂(x) = (1/B) · Σ f̂_b(x)
classification: f̂(x) = majority vote { f̂_b(x) }
The bootstrap is the heart of the method. Sampling with replacement means each D_b contains some duplicates and is missing roughly 37% of the original points (the famous 1/e ≈ 0.368 probability that a given point is never picked in n draws from n items). Those out-of-bag points become a free validation set.
§ 05Bootstrap sampling — animated
Watch how a bootstrap sample is drawn from an original dataset of 10 points. Some points are picked multiple times; others are never picked (the out-of-bag set, shown faded).
Press Animate to begin. The arrow shows which original point is being drawn into the bootstrap sample.
For n = 10, the probability a particular point is excluded from one bag is (1 − 1/n)ⁿ ≈ 0.349. For large n it approaches 1/e ≈ 0.368. Those out-of-bag (OOB) points let you estimate test error with no separate validation set.
§ 06Bagging for regression — worked example
Suppose we want to predict house price from square footage. We bootstrap 5 samples from a training set of 8 houses and fit a small regression tree on each. For a new house at x = 1800 sq ft, the five trees give:
| Tree b | f̂₁ | f̂₂ | f̂₃ | f̂₄ | f̂₅ |
|---|---|---|---|---|---|
| Prediction (in $k) | 295 | 320 | 305 | 340 | 310 |
The bagged regression prediction is the simple average:
Notice the individual trees disagree by up to $45k. The average is more stable than any single tree. To quantify the gain, suppose each tree's error has variance σ² = 400 and the pairwise correlation between trees is ρ = 0.3. Then:
bagged (B=5) Var[f̂(x)] = ρ·σ² + (1−ρ)/B · σ² = 0.3·400 + 0.7/5 · 400
= 120 + 56 = 176 (56% reduction)
As B → ∞, the variance floors at ρσ². The correlation between trees is the bottleneck — which is precisely the problem Random Forest fixes (§08).
A noisy sinusoid. Each thin colored line is a decision tree trained on its own bootstrap of the data. The thick green line is their average — far smoother than any individual.
§ 07Bagging for classification — worked example
For classification we replace averaging with majority vote. Consider spam detection on one email x. We train B = 7 trees on bootstrap samples:
| Tree | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
|---|---|---|---|---|---|---|---|
| Vote | spam | ham | spam | spam | ham | spam | spam |
Votes: 5 spam vs 2 ham → bagged classifier predicts spam.
If we also want a probability, use soft voting — average the predicted class probabilities from each tree:
| Tree | 1 | 2 | 3 | 4 | 5 | 6 | 7 | average |
|---|---|---|---|---|---|---|---|---|
| P(spam | x) | 0.78 | 0.42 | 0.81 | 0.66 | 0.39 | 0.74 | 0.69 | 0.64 |
P(spam) = 0.64 → still spam, with calibrated confidence. Soft voting is generally preferred to hard voting because it preserves uncertainty information.
§ 08Random Forest — fixing the correlation problem
Plain bagged trees share most of their structure: every tree gets to pick from all features at every split, so they all latch onto the same dominant predictor. Their errors stay correlated, and the variance reduction stalls early.
Random Forest (Breiman, 2001) adds a second source of randomness:
then choose the best split among that subset only
This forces trees to use different features, decorrelating their predictions. Recall the bagging variance formula ρσ² + (1−ρ)σ²/B. Random Forest lowers ρ, so the asymptotic floor drops too.
Random Forest = Bagging + per-split feature subsampling. It is one of the most reliably strong baselines in tabular ML.
§ 09Why bagging reduces variance — the math
Let f̂_b(x) denote the prediction of the b-th bagged learner. The bagged predictor is f̂(x) = (1/B) Σ f̂_b(x). Its variance:
= (1/B²) · [ Σ Var[f̂_b] + Σ_{i≠j} Cov[f̂_i, f̂_j] ]
= (1/B²) · [ B·σ² + B(B−1)·ρσ² ]
= ρσ² + (1 − ρ)σ² / B
Two observations:
- If ρ = 0 (independent learners), variance falls as σ²/B — pure averaging.
- If ρ = 1 (identical learners), variance stays at σ² — bagging does nothing.
What about bias? Each bagged tree has roughly the same bias as a single tree trained on the full data (bootstraps are statistically similar to the original). So:
Bagging trades variance for nothing — it's free, provided you can afford the compute and your base learner has variance to spare. That's why deep, unpruned trees are the perfect base learner.
§ 10Bagging — advantages, disadvantages, when to use
- Reduces variance dramatically without inflating bias.
- Embarrassingly parallel — every tree trains independently. Scales linearly with cores.
- Out-of-bag error gives a free, nearly-unbiased estimate of test performance.
- Robust to noise and outliers — a single noisy point lands in only ~63% of bags.
- No hyperparameter tuning crisis — performance is monotone in B (more trees never hurt much).
- Handles missing values and mixed feature types natively (via base trees).
- No bias reduction — if your base learner is itself biased (a stump, a linear model), bagging won't save you.
- Higher prediction latency — must evaluate all B trees at inference.
- Larger model size — B times the memory of a single tree.
- Interpretability suffers — a forest is a black box compared to a single tree.
- Diminishing returns past correlation floor — the ρσ² term doesn't shrink with B.
- You suspect overfitting — high variance is your problem.
- You want a strong, low-effort baseline on tabular data.
- You have noisy labels or outliers.
- You need OOB estimates instead of holding out data.
- You can parallelize training and inference.
§ 11The boosting procedure
Boosting also builds an ensemble, but the philosophy is the opposite of bagging: instead of training independent learners on different views of the data, boosting trains learners sequentially, each one trying to fix the previous one's mistakes.
The general recipe (additive modelling):
for m = 1..M fit a weak learner h_m to the residuals / errors of F_{m−1}
update F_m(x) = F_{m−1}(x) + η · α_m · h_m(x)
return F_M(x)
The base learners h_m are usually weak — typically decision stumps or depth-2/3 trees. Each is barely better than random; the magic is in their sum. η is a small learning rate (typically 0.01–0.1) that slows the descent and improves generalization.
Two flavours dominate:
- AdaBoost The original (Freund & Schapire, 1995). Re-weights data: misclassified points become heavier. Works for classification.
- Gradient Boosting Friedman (1999). Each new tree fits the negative gradient of the loss — i.e. the residuals. Works for any differentiable loss: regression, classification, ranking.
§ 12AdaBoost — animated
AdaBoost on a tiny 2-D dataset. At each round we (1) fit a stump that minimizes weighted error, (2) compute its weight α, (3) up-weight the misclassified points so the next stump pays them more attention.
Each circle is a data point; its size shows its weight. Watch misclassified points grow after each round — that's how the next stump gets nudged to fix them.
§ 13AdaBoost classification — full numeric walkthrough
Let's do every number. Dataset: 5 points, labels in {−1, +1}:
| i | 1 | 2 | 3 | 4 | 5 |
|---|---|---|---|---|---|
| x | 1 | 2 | 3 | 4 | 5 |
| y | +1 | +1 | −1 | −1 | +1 |
Initialize uniform weights:
Round 1
Try every possible stump h(x) = sign(x − θ). The stump h₁(x) = +1 if x < 2.5 else −1 classifies points 1,2,3,4 correctly and misclassifies only point 5.
learner weight α₁ = ½ · ln((1 − ε₁) / ε₁) = ½ · ln(0.80/0.20) = ½ · ln(4) = 0.693
re-weight w_i⁽²⁾ ∝ w_i⁽¹⁾ · exp(−α₁ · y_i · h₁(x_i))
correct: multiply by e^{−0.693} = 0.500
wrong: multiply by e^{+0.693} = 2.000
| i | 1 | 2 | 3 | 4 | 5 | Σ |
|---|---|---|---|---|---|---|
| before norm | 0.100 | 0.100 | 0.100 | 0.100 | 0.400 | 0.800 |
| w⁽²⁾ | 0.125 | 0.125 | 0.125 | 0.125 | 0.500 | 1.000 |
Point 5 — the one we got wrong — now carries half of the total weight. The next stump must deal with it.
Round 2
With these new weights, the best stump is h₂(x) = +1 if x > 4.5 else −1 (correctly classifies point 5 and points 3,4; misclassifies points 1,2).
learner weight α₂ = ½ · ln(0.75/0.25) = ½ · ln(3) = 0.549
re-weight correct: ×e^{−0.549}=0.577 | wrong: ×e^{+0.549}=1.732
| i | 1 | 2 | 3 | 4 | 5 |
|---|---|---|---|---|---|
| before norm | 0.217 | 0.217 | 0.072 | 0.072 | 0.289 |
| w⁽³⁾ | 0.250 | 0.250 | 0.083 | 0.083 | 0.333 |
Round 3
The best stump now is h₃(x) = +1 if 2.5 < x < 4.5 else −1… wait, that's not a single threshold. A proper stump tries every split. With these weights the best simple split is h₃(x) = −1 if x < 2.5 else +1 (correctly handles 3,4,5; wrong on 1,2 — but those have small weight). Let's pick an even better one: h₃(x) = −1 if 2.5 ≤ x ≤ 4.5 else +1 correctly classifies all 5. (For pure stumps we accept that exact zero error is rare — let's use the split that gives ε = 0.166.)
Final classifier
= sign( 0.693·h₁(x) + 0.549·h₂(x) + 0.809·h₃(x) )
Each weak stump alone had ~20–25% error. Their weighted vote separates the data perfectly. That's boosting in action: a sum of weak learners is a strong learner.
§ 14Gradient Boosting for regression — full numeric walkthrough
Gradient boosting generalizes AdaBoost: instead of re-weighting points, we fit each new learner to the negative gradient of the loss with respect to the current prediction. For squared loss, the negative gradient is exactly the residual — making the procedure especially intuitive.
Dataset: predict y from x.
| i | 1 | 2 | 3 | 4 | 5 |
|---|---|---|---|---|---|
| x | 1 | 2 | 3 | 4 | 5 |
| y | 1.0 | 3.0 | 2.0 | 5.0 | 4.0 |
Use squared loss L = ½ (y − F)², learning rate η = 0.5, and stumps of depth 1.
Step 0 — initialize
The optimal constant under squared loss is the mean:
Step 1 — compute residuals, fit tree 1
Residuals r_i = y_i − F₀(x_i):
| i | 1 | 2 | 3 | 4 | 5 |
|---|---|---|---|---|---|
| r⁽¹⁾ | −2.0 | 0.0 | −1.0 | +2.0 | +1.0 |
Fit a stump on (x, r⁽¹⁾). The best split (minimizing SSE) is at x < 3.5:
right leaf (x≥4) mean(+2, +1) = +1.5
Update predictions
| i | 1 | 2 | 3 | 4 | 5 |
|---|---|---|---|---|---|
| h₁ | −1.0 | −1.0 | −1.0 | +1.5 | +1.5 |
| F₁ | 2.50 | 2.50 | 2.50 | 3.75 | 3.75 |
| y | 1.0 | 3.0 | 2.0 | 5.0 | 4.0 |
| SSE | 2.25 | 0.25 | 0.25 | 1.56 | 0.06 |
Total SSE: 4.37 (down from 10.0 at step 0).
Step 2 — new residuals, fit tree 2
| i | 1 | 2 | 3 | 4 | 5 |
|---|---|---|---|---|---|
| r⁽²⁾ = y − F₁ | −1.50 | +0.50 | −0.50 | +1.25 | +0.25 |
Best split for these residuals is at x < 1.5:
right leaf (x≥2) mean(+0.50, −0.50, +1.25, +0.25) = +0.375
Update F₂(x) = F₁(x) + 0.5 · h₂(x):
| i | 1 | 2 | 3 | 4 | 5 |
|---|---|---|---|---|---|
| h₂ | −1.500 | +0.375 | +0.375 | +0.375 | +0.375 |
| F₂ | 1.75 | 2.69 | 2.69 | 3.94 | 3.94 |
| SSE | 0.56 | 0.10 | 0.47 | 1.13 | 0.00 |
Total SSE: 2.26.
After 10 such rounds
Each tree shaves a bit more residual. The total prediction is the sum:
Each round adds a small tree fit to the previous residuals. Watch the prediction (green) creep toward the true function (blue) as rounds accumulate.
Round 0: F₀ = mean(y).
§ 15Gradient Boosting for classification
For binary classification with y ∈ {0, 1} we use the log-loss (cross-entropy):
We don't predict probabilities directly; we predict log-odds F(x) and convert with the sigmoid p = σ(F) = 1/(1+e^{−F}). The negative gradient is beautifully simple:
That's just the prediction error in probability space. So gradient-boosted classification is identical to regression — fit trees to y − p — but in log-odds space.
Numeric example
Dataset: 4 points, 2 positives, 2 negatives.
| i | 1 | 2 | 3 | 4 |
|---|---|---|---|---|
| x | 1 | 2 | 3 | 4 |
| y | 0 | 0 | 1 | 1 |
Step 0: initialize with log-odds of the empirical positive rate:
Step 1: residuals are r⁽¹⁾_i = y_i − p₀ = y_i − 0.5:
| i | 1 | 2 | 3 | 4 |
|---|---|---|---|---|
| r⁽¹⁾ | −0.5 | −0.5 | +0.5 | +0.5 |
Fit a stump. Best split at x < 2.5: left leaf = −0.5, right leaf = +0.5.
With η = 1.0:
| i | F₁ = F₀ + h₁ | p₁ = σ(F₁) |
|---|---|---|
| 1, 2 | −0.5 | 0.378 |
| 3, 4 | +0.5 | 0.622 |
Step 2: new residuals = y − p₁:
| i | 1 | 2 | 3 | 4 |
|---|---|---|---|---|
| r⁽²⁾ | −0.378 | −0.378 | +0.378 | +0.378 |
Same split direction, smaller leaf values. After many rounds, F_M grows large positive on the right, large negative on the left, and p approaches 0 or 1 as appropriate.
The exact same algorithm — fit trees to residuals — works for classification because of the elegant fact that cross-entropy's gradient is just the probability error.
§ 16XGBoost / LightGBM / CatBoost — in a paragraph
Modern gradient boosting libraries are all gradient boosting at heart, with engineering refinements:
| Library | Key innovation | Best at |
|---|---|---|
| XGBoost | Second-order (Newton) gradient updates · L1/L2 regularization on leaves · efficient histogram splits | Tabular data, Kaggle, robust default |
| LightGBM | Leaf-wise growth (vs level-wise) · histogram binning · GOSS sampling · EFB feature bundling | Very large datasets, speed |
| CatBoost | Ordered boosting (prevents target leakage) · native categorical encoding | Categorical-heavy data, less tuning |
All three use the residual-fitting framework from §14 — they just compute splits faster, regularize cleverly, and handle edge cases the textbook version doesn't.
§ 17Why boosting reduces bias — the math
Each round of boosting solves:
By Taylor expansion around F_{m−1}:
where g = ∂L/∂F, H = ∂²L/∂F²
Minimizing in h for fixed leaf values gives h* ≈ −g/H. The new tree fits the negative gradient — that's literal gradient descent in function space.
Each iteration takes the loss strictly downward (with small enough η). The training-set bias drops monotonically. If we ran for infinite rounds, training error would hit zero. But at some point, additional rounds start fitting noise — and test error rises. So boosting needs early stopping; bagging doesn't.
§ 18Boosting — advantages, disadvantages, when to use
- Often the most accurate model on tabular data — dominates Kaggle and tabular benchmarks.
- Reduces bias — works with weak (high-bias) base learners and turns them strong.
- Flexible loss functions — squared, logistic, Poisson, Cox, quantile, ranking, custom. Anything differentiable.
- Built-in feature selection — splits ignore irrelevant features automatically.
- Strong with small/medium data — needs less data than deep nets for similar accuracy.
- Modern libraries (XGBoost, LightGBM, CatBoost) are extremely fast and well-engineered.
- Will overfit if iterations or tree depth are too high — needs early stopping and regularization.
- Sequential training — cannot trivially parallelize across trees (within-tree parallelism only).
- Sensitive to noisy labels and outliers — they keep getting re-emphasized (especially AdaBoost).
- Hyperparameter heavy — learning rate, # trees, depth, regularization, subsample, colsample…
- Harder to interpret than a single tree; comparable to bagging in opacity.
- Slower inference than bagging when trees are deeper or more numerous.
- You're on tabular data and want the highest possible accuracy.
- Your base model underfits — bias is your enemy.
- You can afford a validation set for early stopping.
- Labels are clean; outliers are rare.
- You can spend time tuning hyperparameters (or use Bayesian search).
§ 19Bias-variance side by side
| Bias | Variance | Mechanism | |
|---|---|---|---|
| Single deep tree | low | very high | Fully grown tree memorizes training set |
| Single stump | very high | low | One split can't capture patterns |
| Bagging (deep trees) | low | low | Averaging cancels independent errors |
| Random Forest | low | lower still | Decorrelates trees with random feature subsets |
| Boosting (stumps) | low | moderate | Each stage chips away at residual bias |
| Boosting (overfit) | low | high | Too many rounds memorize noise |
A small noisy regression problem. The green band is the bagged forest; the orange line is the gradient boost. Toggle to compare how each one handles bias and variance over iterations.
§ 20When to use which
- You have noisy labels or many outliers.
- You need a quick, robust baseline with minimal tuning.
- You want OOB error (no need for separate validation).
- Training must be parallel.
- Interpretability of variable importance matters more than peak accuracy.
- You suspect your base model is high-variance.
- You want maximum accuracy on tabular data and can afford tuning.
- Labels are clean; data is reasonably well-curated.
- You can run with early stopping on a validation set.
- Loss is unusual: quantile, ranking, survival — boosting handles these gracefully.
- You suspect your base model is high-bias (under-fitting).
- Competition / production-grade tabular benchmark.
§ 21Who wins? — the empirical verdict
If you only get one answer, here it is:
On noisy/quick-baseline scenarios: random forest wins on robustness and ease of use.
Some numbers from the literature:
- Kaggle competitions: gradient boosting (XGBoost/LightGBM) wins 60–70% of tabular contests; random forests rarely top leaderboards but make solid stacking components.
- Grinsztajn et al. (NeurIPS 2022) — across 45 tabular benchmarks, GBDT outperformed neural nets and held a small but consistent edge over random forests on accuracy.
- Real-world deployment: random forests dominate where ML teams want a model that "just works" with minimal tuning — credit scoring, fraud baselines, healthcare risk models.
- Start with random forest. Get a baseline in 5 lines of code.
- Move to LightGBM/XGBoost if you need to squeeze more accuracy.
- Stack them if you really want every last drop — a forest and a boosted model often make complementary errors.
§ 22Reference — at a glance
| Property | Bagging (RF) | Boosting (GBDT) |
|---|---|---|
| Combining rule | f̂ = (1/B)Σ f̂_b | F_M = F₀ + ηΣ h_m |
| Training | Parallel | Sequential |
| Data per learner | Bootstrap sample | Full data, weighted (or fit to residuals) |
| Base learner | Deep tree (low-bias, high-var) | Shallow tree / stump (high-bias) |
| Reduces | Variance | Bias (+ some var) |
| Overfit risk | Low — asymptotes | High — needs early stopping |
| Noise robustness | High | Lower (esp. AdaBoost) |
| Tuning effort | Low | High |
| Out-of-bag estimate | Yes (free) | No |
| Best for | Quick baselines, noisy data | Peak accuracy on tabular |
| Famous examples | Random Forest, Extra Trees | AdaBoost, GBM, XGBoost, LightGBM, CatBoost |
Companion 06 of 06 · ensemble learning · bagging & bootstrap aggregating · boosting & gradient descent in function space · trade-offs & practical use.