Notebook excerpts
A plain-text scan of every section in this note — the interactive, fully-styled version is in the reader above. Use whichever helps.
01
§ 01 The ensemble idea
A single decision tree is a flawed oracle: deep enough to memorize the training set, shallow enough to miss the pattern. Ensemble learning works around that flaw by combining many imperfect learners into one strong learner. The key insight — sometimes called the wisdom-of-crowds theorem — is that if individual errors are not perfectly correlated, averaging cancels them out.
02
§ 02 Bias-variance recap
For a model f̂(x) trained on a dataset D , the expected squared error at a point x decomposes as:
03
§ 04 The bagging procedure
Bagging — short for B ootstrap Agg regat ing — was introduced by Breiman in 1996. The recipe is three steps:
04
§ 05 Bootstrap sampling — animated
Watch how a bootstrap sample is drawn from an original dataset of 10 points. Some points are picked multiple times; others are never picked (the out-of-bag set, shown faded).
05
§ 06 Bagging for regression — worked example
Suppose we want to predict house price from square footage. We bootstrap 5 samples from a training set of 8 houses and fit a small regression tree on each. For a new house at x = 1800 sq ft , the five trees give:
06
§ 07 Bagging for classification — worked example
For classification we replace averaging with majority vote . Consider spam detection on one email x . We train B = 7 trees on bootstrap samples:
07
§ 08 Random Forest — fixing the correlation problem
Plain bagged trees share most of their structure: every tree gets to pick from all features at every split, so they all latch onto the same dominant predictor. Their errors stay correlated, and the variance reduction stalls early.
08
§ 09 Why bagging reduces variance — the math
Let f̂_b(x) denote the prediction of the b -th bagged learner. The bagged predictor is f̂(x) = (1/B) Σ f̂_b(x) . Its variance:
09
§ 11 The boosting procedure
Boosting also builds an ensemble, but the philosophy is the opposite of bagging: instead of training independent learners on different views of the data, boosting trains learners sequentially , each one trying to fix the previous one's mistakes.
10
§ 12 AdaBoost — animated
AdaBoost on a tiny 2-D dataset. At each round we (1) fit a stump that minimizes weighted error, (2) compute its weight α , (3) up-weight the misclassified points so the next stump pays them more attention.
11
§ 13 AdaBoost classification — full numeric walkthrough
Let's do every number. Dataset: 5 points, labels in {−1, +1} :
12
§ 14 Gradient Boosting for regression — full numeric walkthrough
Gradient boosting generalizes AdaBoost: instead of re-weighting points, we fit each new learner to the negative gradient of the loss with respect to the current prediction. For squared loss, the negative gradient is exactly the residual — making the procedure especially intuitive.
13
§ 15 Gradient Boosting for classification
For binary classification with y ∈ {0, 1} we use the log-loss (cross-entropy) :
14
§ 16 XGBoost / LightGBM / CatBoost — in a paragraph
Modern gradient boosting libraries are all gradient boosting at heart, with engineering refinements:
15
§ 17 Why boosting reduces bias — the math
Minimizing in h for fixed leaf values gives h* ≈ −g/H . The new tree fits the negative gradient — that's literal gradient descent in function space .
16
§ 19 Bias-variance side by side
A small noisy regression problem. The green band is the bagged forest; the orange line is the gradient boost. Toggle to compare how each one handles bias and variance over iterations.
17
§ 22 Reference — at a glance
Companion 06 of 06 · ensemble learning · bagging & bootstrap aggregating · boosting & gradient descent in function space · trade-offs & practical use.