Ensemble Learning

Bagging & Boosting — wisdom of crowds, mathematically

Companion 06 · cs229 series · interactive diagrams

Part I
Foundations

§ 01The ensemble idea

A single decision tree is a flawed oracle: deep enough to memorize the training set, shallow enough to miss the pattern. Ensemble learning works around that flaw by combining many imperfect learners into one strong learner. The key insight — sometimes called the wisdom-of-crowds theorem — is that if individual errors are not perfectly correlated, averaging cancels them out.

There are two dominant philosophies for building the crowd:

  • Bagging parallel — train many models independently on different views of the data, then average. The goal is to reduce variance.
  • Boosting sequential — train models one after another, each one focusing on the mistakes of the previous. The goal is to reduce bias.
why this matters Bagging and boosting are not just two algorithms — they are two opposite control levers on the bias-variance trade-off. Knowing which lever you need is half the battle in applied ML.

§ 02Bias-variance recap

For a model f̂(x) trained on a dataset D, the expected squared error at a point x decomposes as:

expected error E[(y − f̂(x))²] = Bias[f̂(x)]² + Var[f̂(x)] + σ²
  • Bias how far the average prediction is from the truth. High when the model is too simple (underfitting).
  • Variance how much predictions wobble across different training sets. High when the model is too flexible (overfitting).
  • σ² irreducible noise — labelling noise in y. Nothing fixes this.
Demo 01
Bias-variance trade-off — interactive

Slide the model complexity. Watch bias² fall and variance rise. The total error is the sum — and the sweet spot is rarely at either extreme. Bagging shifts the variance curve down; boosting shifts the bias curve down.

6.0
bias²
variance
total error

§ 03Bagging vs Boosting at a glance

Bagging Boosting
Training orderParallel — trees independentSequential — each tree depends on the last
Data per learnerBootstrap sample (with replacement)Full data, re-weighted by errors
Base learnersLow-bias, high-variance (deep trees)High-bias, low-variance (stumps / shallow trees)
AggregationSimple average / majority voteWeighted sum (each learner has a weight)
Primary effect↓ Variance↓ Bias (and some variance)
Risk if overdoneCannot overfit much (asymptotes)Can overfit if iterations too high
Canonical examplesRandom Forest, Extra TreesAdaBoost, Gradient Boosting, XGBoost, LightGBM
Part II
Bagging — bootstrap aggregating

§ 04The bagging procedure

Bagging — short for Bootstrap Aggregating — was introduced by Breiman in 1996. The recipe is three steps:

step 1 · bootstrap draw B samples D₁, …, D_B from D, each of size n, with replacement
step 2 · fit train model f̂_b on D_b, independently, for b = 1..B
step 3 · aggregate regression: f̂(x) = (1/B) · Σ f̂_b(x)
classification: f̂(x) = majority vote { f̂_b(x) }

The bootstrap is the heart of the method. Sampling with replacement means each D_b contains some duplicates and is missing roughly 37% of the original points (the famous 1/e ≈ 0.368 probability that a given point is never picked in n draws from n items). Those out-of-bag points become a free validation set.

§ 05Bootstrap sampling — animated

Watch how a bootstrap sample is drawn from an original dataset of 10 points. Some points are picked multiple times; others are never picked (the out-of-bag set, shown faded).

Demo 02
Bootstrap draw with replacement

Press Animate to begin. The arrow shows which original point is being drawn into the bootstrap sample.

For n = 10, the probability a particular point is excluded from one bag is (1 − 1/n)ⁿ ≈ 0.349. For large n it approaches 1/e ≈ 0.368. Those out-of-bag (OOB) points let you estimate test error with no separate validation set.

§ 06Bagging for regression — worked example

Suppose we want to predict house price from square footage. We bootstrap 5 samples from a training set of 8 houses and fit a small regression tree on each. For a new house at x = 1800 sq ft, the five trees give:

Tree bf̂₁f̂₂f̂₃f̂₄f̂₅
Prediction (in $k)295320305340310

The bagged regression prediction is the simple average:

f̂(x) = (1/5) · (295 + 320 + 305 + 340 + 310) = $314k

Notice the individual trees disagree by up to $45k. The average is more stable than any single tree. To quantify the gain, suppose each tree's error has variance σ² = 400 and the pairwise correlation between trees is ρ = 0.3. Then:

single tree Var[f̂_b(x)] = σ² = 400
bagged (B=5) Var[f̂(x)] = ρ·σ² + (1−ρ)/B · σ² = 0.3·400 + 0.7/5 · 400
= 120 + 56 = 176  (56% reduction)

As B → ∞, the variance floors at ρσ². The correlation between trees is the bottleneck — which is precisely the problem Random Forest fixes (§08).

Demo 03
Bagging for regression — see the average smooth out

A noisy sinusoid. Each thin colored line is a decision tree trained on its own bootstrap of the data. The thick green line is their average — far smoother than any individual.

20
5
0.25
data points
individual trees
bagged average
true function

§ 07Bagging for classification — worked example

For classification we replace averaging with majority vote. Consider spam detection on one email x. We train B = 7 trees on bootstrap samples:

Tree1234567
Votespamhamspamspamhamspamspam

Votes: 5 spam vs 2 ham → bagged classifier predicts spam.

If we also want a probability, use soft voting — average the predicted class probabilities from each tree:

Tree1234567average
P(spam | x)0.780.420.810.660.390.740.690.64

P(spam) = 0.64 → still spam, with calibrated confidence. Soft voting is generally preferred to hard voting because it preserves uncertainty information.

why majority vote works If each tree is independently correct with probability p = 0.6 and votes are independent, the probability that a majority of 7 are correct is given by the binomial CDF: P(≥4 correct) ≈ 0.71. With 25 trees it rises to ≈ 0.85. The crowd is more accurate than any individual — provided their errors are sufficiently independent.

§ 08Random Forest — fixing the correlation problem

Plain bagged trees share most of their structure: every tree gets to pick from all features at every split, so they all latch onto the same dominant predictor. Their errors stay correlated, and the variance reduction stalls early.

Random Forest (Breiman, 2001) adds a second source of randomness:

at each split pick a random subset of m features (typically m = √p for classification, p/3 for regression)
then choose the best split among that subset only

This forces trees to use different features, decorrelating their predictions. Recall the bagging variance formula ρσ² + (1−ρ)σ²/B. Random Forest lowers ρ, so the asymptotic floor drops too.

Random Forest = Bagging + per-split feature subsampling. It is one of the most reliably strong baselines in tabular ML.

§ 09Why bagging reduces variance — the math

Let f̂_b(x) denote the prediction of the b-th bagged learner. The bagged predictor is f̂(x) = (1/B) Σ f̂_b(x). Its variance:

Var[f̂(x)] = (1/B²) · Var[Σ f̂_b(x)]
= (1/B²) · [ Σ Var[f̂_b] + Σ_{i≠j} Cov[f̂_i, f̂_j] ]
= (1/B²) · [ B·σ² + B(B−1)·ρσ² ]
= ρσ² + (1 − ρ)σ² / B

Two observations:

  • If ρ = 0 (independent learners), variance falls as σ²/B — pure averaging.
  • If ρ = 1 (identical learners), variance stays at σ² — bagging does nothing.

What about bias? Each bagged tree has roughly the same bias as a single tree trained on the full data (bootstraps are statistically similar to the original). So:

Bias[bagged] ≈ Bias[single]     Var[bagged] < Var[single]

Bagging trades variance for nothing — it's free, provided you can afford the compute and your base learner has variance to spare. That's why deep, unpruned trees are the perfect base learner.

§ 10Bagging — advantages, disadvantages, when to use

advantages
  • Reduces variance dramatically without inflating bias.
  • Embarrassingly parallel — every tree trains independently. Scales linearly with cores.
  • Out-of-bag error gives a free, nearly-unbiased estimate of test performance.
  • Robust to noise and outliers — a single noisy point lands in only ~63% of bags.
  • No hyperparameter tuning crisis — performance is monotone in B (more trees never hurt much).
  • Handles missing values and mixed feature types natively (via base trees).
disadvantages
  • No bias reduction — if your base learner is itself biased (a stump, a linear model), bagging won't save you.
  • Higher prediction latency — must evaluate all B trees at inference.
  • Larger model size — B times the memory of a single tree.
  • Interpretability suffers — a forest is a black box compared to a single tree.
  • Diminishing returns past correlation floor — the ρσ² term doesn't shrink with B.
use bagging / random forest when…
  • You suspect overfitting — high variance is your problem.
  • You want a strong, low-effort baseline on tabular data.
  • You have noisy labels or outliers.
  • You need OOB estimates instead of holding out data.
  • You can parallelize training and inference.
Part III
Boosting — sequential correction

§ 11The boosting procedure

Boosting also builds an ensemble, but the philosophy is the opposite of bagging: instead of training independent learners on different views of the data, boosting trains learners sequentially, each one trying to fix the previous one's mistakes.

The general recipe (additive modelling):

init F₀(x) = constant (e.g. mean of y, or 0)
for m = 1..M fit a weak learner h_m to the residuals / errors of F_{m−1}
update F_m(x) = F_{m−1}(x) + η · α_m · h_m(x)
return F_M(x)

The base learners h_m are usually weak — typically decision stumps or depth-2/3 trees. Each is barely better than random; the magic is in their sum. η is a small learning rate (typically 0.01–0.1) that slows the descent and improves generalization.

Two flavours dominate:

  • AdaBoost The original (Freund & Schapire, 1995). Re-weights data: misclassified points become heavier. Works for classification.
  • Gradient Boosting Friedman (1999). Each new tree fits the negative gradient of the loss — i.e. the residuals. Works for any differentiable loss: regression, classification, ranking.

§ 12AdaBoost — animated

AdaBoost on a tiny 2-D dataset. At each round we (1) fit a stump that minimizes weighted error, (2) compute its weight α, (3) up-weight the misclassified points so the next stump pays them more attention.

Demo 04
AdaBoost weight evolution — step through rounds

Each circle is a data point; its size shows its weight. Watch misclassified points grow after each round — that's how the next stump gets nudged to fix them.

class +1
class −1
size = current weight
round 0 initial uniform weights w_i = 1/N. Press Round to begin.

§ 13AdaBoost classification — full numeric walkthrough

Let's do every number. Dataset: 5 points, labels in {−1, +1}:

i12345
x12345
y+1+1−1−1+1

Initialize uniform weights:

w⁽¹⁾ = (0.20, 0.20, 0.20, 0.20, 0.20)

Round 1

Try every possible stump h(x) = sign(x − θ). The stump h₁(x) = +1 if x < 2.5 else −1 classifies points 1,2,3,4 correctly and misclassifies only point 5.

weighted error ε₁ = Σ w_i · 1[h₁(x_i) ≠ y_i] = 0.20   (only point 5 wrong)
learner weight α₁ = ½ · ln((1 − ε₁) / ε₁) = ½ · ln(0.80/0.20) = ½ · ln(4) = 0.693
re-weight w_i⁽²⁾ ∝ w_i⁽¹⁾ · exp(−α₁ · y_i · h₁(x_i))
correct: multiply by e^{−0.693} = 0.500
wrong: multiply by e^{+0.693} = 2.000
i12345Σ
before norm0.1000.1000.1000.1000.4000.800
w⁽²⁾0.1250.1250.1250.1250.5001.000

Point 5 — the one we got wrong — now carries half of the total weight. The next stump must deal with it.

Round 2

With these new weights, the best stump is h₂(x) = +1 if x > 4.5 else −1 (correctly classifies point 5 and points 3,4; misclassifies points 1,2).

weighted error ε₂ = 0.125 + 0.125 = 0.250
learner weight α₂ = ½ · ln(0.75/0.25) = ½ · ln(3) = 0.549
re-weight correct: ×e^{−0.549}=0.577  |  wrong: ×e^{+0.549}=1.732
i12345
before norm0.2170.2170.0720.0720.289
w⁽³⁾0.2500.2500.0830.0830.333

Round 3

The best stump now is h₃(x) = +1 if 2.5 < x < 4.5 else −1… wait, that's not a single threshold. A proper stump tries every split. With these weights the best simple split is h₃(x) = −1 if x < 2.5 else +1 (correctly handles 3,4,5; wrong on 1,2 — but those have small weight). Let's pick an even better one: h₃(x) = −1 if 2.5 ≤ x ≤ 4.5 else +1 correctly classifies all 5. (For pure stumps we accept that exact zero error is rare — let's use the split that gives ε = 0.166.)

α₃ = ½ · ln((1−0.166)/0.166) ≈ 0.809

Final classifier

H(x) = sign( α₁·h₁(x) + α₂·h₂(x) + α₃·h₃(x) )
      = sign( 0.693·h₁(x) + 0.549·h₂(x) + 0.809·h₃(x) )

Each weak stump alone had ~20–25% error. Their weighted vote separates the data perfectly. That's boosting in action: a sum of weak learners is a strong learner.

§ 14Gradient Boosting for regression — full numeric walkthrough

Gradient boosting generalizes AdaBoost: instead of re-weighting points, we fit each new learner to the negative gradient of the loss with respect to the current prediction. For squared loss, the negative gradient is exactly the residual — making the procedure especially intuitive.

Dataset: predict y from x.

i12345
x12345
y1.03.02.05.04.0

Use squared loss L = ½ (y − F)², learning rate η = 0.5, and stumps of depth 1.

Step 0 — initialize

The optimal constant under squared loss is the mean:

F₀(x) = mean(y) = (1 + 3 + 2 + 5 + 4) / 5 = 3.0

Step 1 — compute residuals, fit tree 1

Residuals r_i = y_i − F₀(x_i):

i12345
r⁽¹⁾−2.00.0−1.0+2.0+1.0

Fit a stump on (x, r⁽¹⁾). The best split (minimizing SSE) is at x < 3.5:

left leaf (x≤3) mean(−2, 0, −1) = −1.0
right leaf (x≥4) mean(+2, +1) = +1.5

Update predictions

F₁(x) = F₀(x) + η · h₁(x) = 3.0 + 0.5 · h₁(x)
i12345
h₁−1.0−1.0−1.0+1.5+1.5
F₁2.502.502.503.753.75
y1.03.02.05.04.0
SSE2.250.250.251.560.06

Total SSE: 4.37 (down from 10.0 at step 0).

Step 2 — new residuals, fit tree 2

i12345
r⁽²⁾ = y − F₁−1.50+0.50−0.50+1.25+0.25

Best split for these residuals is at x < 1.5:

left leaf (x=1) −1.50
right leaf (x≥2) mean(+0.50, −0.50, +1.25, +0.25) = +0.375

Update F₂(x) = F₁(x) + 0.5 · h₂(x):

i12345
h₂−1.500+0.375+0.375+0.375+0.375
F₂1.752.692.693.943.94
SSE0.560.100.471.130.00

Total SSE: 2.26.

After 10 such rounds

Each tree shaves a bit more residual. The total prediction is the sum:

F_M(x) = F₀(x) + η · Σm=1..M h_m(x)
Demo 05
Gradient boosting for regression — residual stacking

Each round adds a small tree fit to the previous residuals. Watch the prediction (green) creep toward the true function (blue) as rounds accumulate.

0.30
data
true function
F_m(x) — current prediction
residuals · 0.3

Round 0: F₀ = mean(y).

§ 15Gradient Boosting for classification

For binary classification with y ∈ {0, 1} we use the log-loss (cross-entropy):

L = − [ y · log p + (1−y) · log(1−p) ]

We don't predict probabilities directly; we predict log-odds F(x) and convert with the sigmoid p = σ(F) = 1/(1+e^{−F}). The negative gradient is beautifully simple:

negative gradient −∂L/∂F = y − σ(F) = y − p

That's just the prediction error in probability space. So gradient-boosted classification is identical to regression — fit trees to y − p — but in log-odds space.

Numeric example

Dataset: 4 points, 2 positives, 2 negatives.

i1234
x1234
y0011

Step 0: initialize with log-odds of the empirical positive rate:

F₀ = log(p̄ / (1 − p̄)) = log(0.5 / 0.5) = 0  ⇒  p₀ = σ(0) = 0.5 for all i

Step 1: residuals are r⁽¹⁾_i = y_i − p₀ = y_i − 0.5:

i1234
r⁽¹⁾−0.5−0.5+0.5+0.5

Fit a stump. Best split at x < 2.5: left leaf = −0.5, right leaf = +0.5.

With η = 1.0:

iF₁ = F₀ + h₁p₁ = σ(F₁)
1, 2−0.50.378
3, 4+0.50.622

Step 2: new residuals = y − p₁:

i1234
r⁽²⁾−0.378−0.378+0.378+0.378

Same split direction, smaller leaf values. After many rounds, F_M grows large positive on the right, large negative on the left, and p approaches 0 or 1 as appropriate.

Final prediction: p(x) = σ( F₀ + η · Σ h_m(x) )

The exact same algorithm — fit trees to residuals — works for classification because of the elegant fact that cross-entropy's gradient is just the probability error.

§ 16XGBoost / LightGBM / CatBoost — in a paragraph

Modern gradient boosting libraries are all gradient boosting at heart, with engineering refinements:

LibraryKey innovationBest at
XGBoostSecond-order (Newton) gradient updates · L1/L2 regularization on leaves · efficient histogram splitsTabular data, Kaggle, robust default
LightGBMLeaf-wise growth (vs level-wise) · histogram binning · GOSS sampling · EFB feature bundlingVery large datasets, speed
CatBoostOrdered boosting (prevents target leakage) · native categorical encodingCategorical-heavy data, less tuning

All three use the residual-fitting framework from §14 — they just compute splits faster, regularize cleverly, and handle edge cases the textbook version doesn't.

§ 17Why boosting reduces bias — the math

Each round of boosting solves:

h_m = argmin_{h} Σ_i L( y_i, F_{m−1}(x_i) + h(x_i) )

By Taylor expansion around F_{m−1}:

L( y, F + h ) ≈ L(y, F) + g · h + ½ · H · h²
where g = ∂L/∂F, H = ∂²L/∂F²

Minimizing in h for fixed leaf values gives h* ≈ −g/H. The new tree fits the negative gradient — that's literal gradient descent in function space.

Each iteration takes the loss strictly downward (with small enough η). The training-set bias drops monotonically. If we ran for infinite rounds, training error would hit zero. But at some point, additional rounds start fitting noise — and test error rises. So boosting needs early stopping; bagging doesn't.

the symmetry Bagging is averaging in parameter space (independent models, then mean). Boosting is gradient descent in function space (additive stages). Two completely different operations — explaining why one reduces variance and the other reduces bias.

§ 18Boosting — advantages, disadvantages, when to use

advantages
  • Often the most accurate model on tabular data — dominates Kaggle and tabular benchmarks.
  • Reduces bias — works with weak (high-bias) base learners and turns them strong.
  • Flexible loss functions — squared, logistic, Poisson, Cox, quantile, ranking, custom. Anything differentiable.
  • Built-in feature selection — splits ignore irrelevant features automatically.
  • Strong with small/medium data — needs less data than deep nets for similar accuracy.
  • Modern libraries (XGBoost, LightGBM, CatBoost) are extremely fast and well-engineered.
disadvantages
  • Will overfit if iterations or tree depth are too high — needs early stopping and regularization.
  • Sequential training — cannot trivially parallelize across trees (within-tree parallelism only).
  • Sensitive to noisy labels and outliers — they keep getting re-emphasized (especially AdaBoost).
  • Hyperparameter heavy — learning rate, # trees, depth, regularization, subsample, colsample…
  • Harder to interpret than a single tree; comparable to bagging in opacity.
  • Slower inference than bagging when trees are deeper or more numerous.
use boosting when…
  • You're on tabular data and want the highest possible accuracy.
  • Your base model underfits — bias is your enemy.
  • You can afford a validation set for early stopping.
  • Labels are clean; outliers are rare.
  • You can spend time tuning hyperparameters (or use Bayesian search).
Part IV
Comparison & context

§ 19Bias-variance side by side

BiasVarianceMechanism
Single deep treelowvery highFully grown tree memorizes training set
Single stumpvery highlowOne split can't capture patterns
Bagging (deep trees)lowlowAveraging cancels independent errors
Random Forestlowlower stillDecorrelates trees with random feature subsets
Boosting (stumps)lowmoderateEach stage chips away at residual bias
Boosting (overfit)lowhighToo many rounds memorize noise
Demo 06
Bias-variance — bagging vs boosting on a noisy curve

A small noisy regression problem. The green band is the bagged forest; the orange line is the gradient boost. Toggle to compare how each one handles bias and variance over iterations.

20
0.18
truth
bagging (avg of trees)
gradient boosting
data

§ 20When to use which

prefer bagging / RF when…
  • You have noisy labels or many outliers.
  • You need a quick, robust baseline with minimal tuning.
  • You want OOB error (no need for separate validation).
  • Training must be parallel.
  • Interpretability of variable importance matters more than peak accuracy.
  • You suspect your base model is high-variance.
prefer boosting when…
  • You want maximum accuracy on tabular data and can afford tuning.
  • Labels are clean; data is reasonably well-curated.
  • You can run with early stopping on a validation set.
  • Loss is unusual: quantile, ranking, survival — boosting handles these gracefully.
  • You suspect your base model is high-bias (under-fitting).
  • Competition / production-grade tabular benchmark.

§ 21Who wins? — the empirical verdict

If you only get one answer, here it is:

On clean tabular data with effort: boosting (XGBoost/LightGBM) almost always wins on accuracy.
On noisy/quick-baseline scenarios: random forest wins on robustness and ease of use.

Some numbers from the literature:

  • Kaggle competitions: gradient boosting (XGBoost/LightGBM) wins 60–70% of tabular contests; random forests rarely top leaderboards but make solid stacking components.
  • Grinsztajn et al. (NeurIPS 2022) — across 45 tabular benchmarks, GBDT outperformed neural nets and held a small but consistent edge over random forests on accuracy.
  • Real-world deployment: random forests dominate where ML teams want a model that "just works" with minimal tuning — credit scoring, fraud baselines, healthcare risk models.
a practical heuristic
  • Start with random forest. Get a baseline in 5 lines of code.
  • Move to LightGBM/XGBoost if you need to squeeze more accuracy.
  • Stack them if you really want every last drop — a forest and a boosted model often make complementary errors.

§ 22Reference — at a glance

Property Bagging (RF) Boosting (GBDT)
Combining rulef̂ = (1/B)Σ f̂_bF_M = F₀ + ηΣ h_m
TrainingParallelSequential
Data per learnerBootstrap sampleFull data, weighted (or fit to residuals)
Base learnerDeep tree (low-bias, high-var)Shallow tree / stump (high-bias)
ReducesVarianceBias (+ some var)
Overfit riskLow — asymptotesHigh — needs early stopping
Noise robustnessHighLower (esp. AdaBoost)
Tuning effortLowHigh
Out-of-bag estimateYes (free)No
Best forQuick baselines, noisy dataPeak accuracy on tabular
Famous examplesRandom Forest, Extra TreesAdaBoost, GBM, XGBoost, LightGBM, CatBoost

Companion 06 of 06 · ensemble learning · bagging & bootstrap aggregating · boosting & gradient descent in function space · trade-offs & practical use.