Regression & model-selection metrics,
visualized.

Sums of squares, R², adjusted R², Mallow's Cp, AIC, BIC — the regression side of the CS 229 cheat sheet, explained and explored interactively.

8 metrics, 2 interactive demos Approx. 12 min read

Contents

Regression & Model Selection

Sums of squares
R²
Adjusted R²
Mallow's Cp
AIC
BIC
Model selection in action
Regression reference

See also

For classification metrics — confusion matrix, precision, recall, F1, ROC & AUC — see Classification & Loss Functions.

§ 09The three sums of squares

Where classification splits errors into four discrete cells, regression has continuous errors. Predictions f(x) are real numbers, truth values y are real numbers, and the error is a distance — a positive or negative gap, squared for the math to work out cleanly. Almost every regression metric on this page is built from one of three sums of squares.

Each yᵢ decomposes into two pieces relative to the mean ȳ: what the model predicted, and what it missed. Squared and summed, these pieces are the building blocks of every regression metric.

Drag the line below to control your model's slope and intercept. Toggle the colored segments to see each "sum of squares" appear literally on the chart — vertical bars whose squared lengths sum to the named quantity.

Your fit f(x) Mean ȳ Data yᵢ

Slope0.82

Intercept1.40

Residuals (yᵢ − f(xᵢ)) Explained (f(xᵢ) − ȳ) Total (yᵢ − ȳ)

SS_res

5.67

Σ(y−f(x))²

SS_reg

55.24

Σ(f(x)−ȳ)²

SS_tot

60.90

Σ(y−ȳ)²

R²

0.907

1 − SS_res/SS_tot

Your line explains 91% of the total variance.

0% (R² = 0)100% (R² = 1)

Try a line

SS_tot is a fact about the data, not the model — it's the total variation in y around its mean. SS_reg is how much variation the model introduces by tilting away from a flat line at ȳ. SS_res is the part the model couldn't reach — the residuals OLS literally minimizes. For an OLS fit specifically, these three obey an exact identity: SS_tot = SS_reg + SS_res. Drag the line away from OLS and you'll see the identity break.

§ 10R² — the coefficient of determination

R² collapses the three sums of squares into a single number on a meaningful scale. Its definition is built around a baseline comparison: what would happen if we just predicted ȳ for every input? That dumb model has residuals equal to SS_tot exactly. So the ratio SS_res / SS_tot tells you what fraction of the baseline's error your model still has left over. Subtract from 1 and you get the fraction you removed.

R² = 1 − SS_res / SS_tot The fraction of variance the model explains.

Merits

Uniform scale across datasets — a 0.8 always means "explained 80% of the variance," and the interpretation is immediately legible to non-specialists.

Works for any model, not just OLS, and the same formula applies on training data, validation data, or test data. The universal headline metric in regression.

Demerits

Fatally flawed for model selection: adding features can only ever increase R² on training data, even useless features that fit pure noise. Always tells you "more is better" — exactly wrong for generalization.

Can be negative on held-out data — most practitioners are surprised the first time they see it. A model worse than predicting ȳ produces negative R².

§ 11Adjusted R²

The fix is to penalize R² by the number of parameters. Adjusted R² rescales the unexplained-variance ratio by degrees of freedom. Instead of dividing by m and m, it divides by m − 1 and m − n − 1, where n is the number of predictors. Adding a useless feature shrinks the denominator faster than it shrinks the numerator, so the metric drops.

Adjusted R² = 1 − (1 − R²)(m − 1) / (m − n − 1) R² with a complexity tax. Drops when added features don't earn their degree of freedom.

Merits

Directly addresses R²'s overfitting blind spot — drops when a new feature doesn't pull its weight, rises when it does. The simplest possible correction.

On the same interpretable scale as R², so it's easy to communicate. Cheap to compute given R² and the model dimensions.

Demerits

Only meaningful for comparing nested linear models on the same dataset. The penalty is ad-hoc, not derived from information theory or Bayesian considerations.

Not bounded below at 0 — with enough useless features it can go arbitrarily negative. Provides weaker selection guidance than cross-validation, which measures held-out performance directly.

§ 12Mallow's Cp

Mallow's Cp is the historical ancestor of the model-selection family. It compares a model's residual sum of squares against an estimate of the true noise variance σ̂², usually taken from a "full" reference model with all candidate features. The diagnostic is precise: a well-calibrated submodel has Cp close to n + 1. Values much larger indicate the model can't explain structure beyond noise — it's underfitting.

Cp = (SS_res + 2(n + 1)σ̂²) / m An unbiased estimate of test mean squared error in y-units.

Merits

Has a clean interpretable diagnostic — "Cp ≈ n + 1" gives an immediate calibration check, unlike AIC or BIC whose absolute values aren't meaningful in isolation.

An unbiased estimate of out-of-sample mean squared error under linear regression assumptions. Specifically designed for subset selection in linear models.

Demerits

Requires an estimate of σ², which means fitting a "full" model first. That's expensive and forces you to commit to a candidate feature set in advance.

Assumes linear regression with Gaussian errors. Doesn't generalize to non-linear models the way AIC and BIC do, which limits its usefulness outside its original context.

§ 13AIC — Akaike Information Criterion

AIC takes a different route. Instead of starting from R² or sums of squares, it starts from the model's likelihood L — how probable the observed data is under the fitted model — and penalizes by twice the number of parameters. The 2 in the penalty isn't an ad-hoc fudge; it comes from the expected Kullback-Leibler divergence between the true and estimated distributions. Lower AIC is better, and absolute values matter only in comparisons between models on the same data.

AIC = 2(n + 2) − 2 log(L) Likelihood-based fit, complexity penalty grounded in information theory.

Merits

Works for any model with a likelihood — linear regression, logistic regression, GLMs, mixture models, time series. Not restricted to OLS the way Cp and adjusted R² are.

Rigorous theoretical foundations rooted in information theory. Provides absolute numbers, so non-nested models can be compared directly.

Demerits

Tends to choose slightly larger models than is optimal. It is not "consistent" — even with infinite data, it has a small probability of preferring an overfit model over the true one.

Derivation assumes the true model is not in the candidate set; if it is, BIC has stronger theoretical justification. Requires correct likelihood specification, which can be a strong assumption.

§ 14BIC — Bayesian Information Criterion

BIC has the same shape as AIC but a heavier penalty that grows with sample size. Where AIC uses 2 per parameter, BIC uses log(m). With many samples, BIC becomes increasingly conservative and prefers simpler models. AIC and BIC often pick the same model on small data and diverge on large data, with BIC consistently picking the smaller of the two.

BIC = log(m) · (n + 2) − 2 log(L) Stricter penalty than AIC. Asymptotically selects the true model under regularity conditions.

Merits

Consistent: under regularity conditions, BIC asymptotically selects the true model whenever the true model is in the candidate set. This is a stronger guarantee than AIC offers.

Favors parsimony, which often produces more interpretable models. Has a clean Bayesian interpretation as an approximation to the marginal likelihood of the model.

Demerits

Can be too aggressive on small samples, choosing models that underfit. The log(m) penalty grows slowly but is already strict at m = 50 or so.

Assumes the true model is in the candidate set, which is rarely true in practice. When no candidate is correct, AIC's predictive emphasis has stronger justification.

§ 15Model selection in action

The four metrics above all attack the same problem from slightly different angles. The interactive below puts them on the same chart so you can watch them disagree and agree. The setup: 50 synthetic data points generated from a true model that depends on exactly two features. The slider controls how many features the regression uses — slide right and the model gets features 1, 2, then 3, 4, ..., up to 10. Features 3 through 10 are pure noise.

Two features used — the true model. Features beyond this point are pure noise.

R² Adjusted R² AIC (rescaled) BIC (rescaled) Mallow's Cp (rescaled)

R²—

Never decreases as features are added — useless for model selection on its own.

Adjusted R²—

Penalizes each new feature; drops when the added feature isn't worth its degree of freedom.

AIC—

Lower is better. Penalty 2(n+2) — moderate; tends to pick slightly larger models.

BIC—

Lower is better. Penalty log(m)(n+2) — stricter; prefers simpler models as m grows.

Mallow's Cp—

A well-calibrated model has Cp ≈ n+1. Far above means underfitting; well below can indicate overfit on training data.

The pattern is what you should look for in your own work. R² climbs the whole way, hiding the overfit. Adjusted R² peaks at k=2 and starts dropping. AIC and BIC bottom out at k=2 and start rising. Mallow's Cp tracks (n+1) closely at k=2 and grows worse with the wrong number of features. The gap between R² and the penalized metrics is the overfitting signature — when R² says "great" and AIC says "no", trust AIC.

§ 16Regression reference

Metric	Formula	When to reach for it
SS_totΣ(yᵢ − ȳ)²	Total variation in the outcome.	As the denominator in R²; as a "budget" check on data variability.
SS_regΣ(f(xᵢ) − ȳ)²	Variation introduced by the model.	Diagnostic; for OLS equals SS_tot − SS_res and powers R².
SS_resΣ(yᵢ − f(xᵢ))²	Variation left unexplained.	The quantity OLS minimizes. Track directly during training.
R²1 − SS_res/SS_tot	Fraction of variance explained.	Reporting a model's headline fit. Beware on held-out data — can go negative.
Adjusted R²1 − (1−R²)(m−1)/(m−n−1)	R² with complexity penalty.	Comparing nested linear models of different sizes.
Mallow's Cp(SS_res + 2(n+1)σ̂²) / m	Estimated test MSE in y-units.	Classical subset selection in linear regression with a reference σ̂².
AIC2(n+2) − 2 log(L)	Likelihood with moderate complexity penalty.	Non-nested model comparison with predictive emphasis. Works beyond OLS.
BIClog(m)(n+2) − 2 log(L)	Likelihood with stricter sample-size-aware penalty.	When consistency matters and the true model is plausibly in the candidate set.

A pragmatic discipline that consistently outperforms blind allegiance to any single metric: look at several at once, treat large disagreements as signals that something is wrong, and validate the final model on data it never saw during selection.