Regression & Model Selection Metrics, Visualized

Notebook excerpts

A plain-text scan of every section in this note — the interactive, fully-styled version is in the reader above. Use whichever helps.

01
§ 09 The three sums of squares
Where classification splits errors into four discrete cells, regression has continuous errors. Predictions f(x) are real numbers, truth values y are real numbers, and the error is a distance — a positive or negative gap, squared for the math to work out cleanly. Almost every regression metric on this page is built from one of three sums of squares.
02
§ 10 R² — the coefficient of determination
R² collapses the three sums of squares into a single number on a meaningful scale. Its definition is built around a baseline comparison: what would happen if we just predicted ȳ for every input? That dumb model has residuals equal to SS tot exactly. So the ratio SS res / SS tot tells you what fraction of the baseline's error your model still has left over. Subtract from 1 and you get the fraction you removed.
03
§ 11 Adjusted R²
The fix is to penalize R² by the number of parameters. Adjusted R² rescales the unexplained-variance ratio by degrees of freedom. Instead of dividing by m and m , it divides by m − 1 and m − n − 1 , where n is the number of predictors. Adding a useless feature shrinks the denominator faster than it shrinks the numerator, so the metric drops.
04
§ 12 Mallow's Cp
Mallow's Cp is the historical ancestor of the model-selection family. It compares a model's residual sum of squares against an estimate of the true noise variance σ̂² , usually taken from a "full" reference model with all candidate features. The diagnostic is precise: a well-calibrated submodel has Cp close to n + 1 . Values much larger indicate the model can't explain structure beyond noise — it's underfitting.
05
§ 13 AIC — Akaike Information Criterion
AIC takes a different route. Instead of starting from R² or sums of squares, it starts from the model's likelihood L — how probable the observed data is under the fitted model — and penalizes by twice the number of parameters. The 2 in the penalty isn't an ad-hoc fudge; it comes from the expected Kullback-Leibler divergence between the true and estimated distributions. Lower AIC is better, and absolute values matter only in comparisons between models on the same data.
06
§ 14 BIC — Bayesian Information Criterion
BIC has the same shape as AIC but a heavier penalty that grows with sample size. Where AIC uses 2 per parameter, BIC uses log(m) . With many samples, BIC becomes increasingly conservative and prefers simpler models. AIC and BIC often pick the same model on small data and diverge on large data, with BIC consistently picking the smaller of the two.
07
§ 15 Model selection in action
The four metrics above all attack the same problem from slightly different angles. The interactive below puts them on the same chart so you can watch them disagree and agree. The setup: 50 synthetic data points generated from a true model that depends on exactly two features. The slider controls how many features the regression uses — slide right and the model gets features 1, 2, then 3, 4, ..., up to 10. Features 3 through 10 are pure noise.
08
§ 16 Regression reference
A pragmatic discipline that consistently outperforms blind allegiance to any single metric: look at several at once, treat large disagreements as signals that something is wrong, and validate the final model on data it never saw during selection.