Notebook excerpts
A plain-text scan of every section in this note — the interactive, fully-styled version is in the reader above. Use whichever helps.
01
§ 01 The confusion matrix
Classification predictions split four ways. The model says yes or no; the truth is yes or no. Every prediction lands in one of four cells. A true positive is a yes-yes — the model correctly raised the alarm. A false negative is a no-yes — the model missed something real. A false positive is a yes-no — the model raised a false alarm. A true negative is a no-no — quietly correct, the most common cell in most real systems.
02
§ 02 Accuracy
The most obvious question to ask a classifier is: how often is it right? Accuracy answers exactly that — it's the sum of the two diagonal cells of the confusion matrix divided by every prediction made. The formula is symmetric, treating false positives and false negatives identically. That simplicity is its biggest virtue and its biggest trap.
03
§ 03 Precision
Once accuracy fails on imbalanced data, you need a metric that focuses only on the predictions you actually care about — the positive ones. Precision asks the question every practitioner eventually faces: when this model says yes, how often is it right? The denominator is everything the model labeled positive, so the metric is unaffected by however many negatives sit quietly in the data.
04
§ 04 Recall (Sensitivity)
Recall flips the question. Where precision asks about the model's predictions, recall asks about the truth: of all the actual positives out there, how many did the model catch? Its denominator is fixed by reality — it depends on how many positives exist, not on how many alarms the model raised. The medical literature calls this same quantity sensitivity; the machine learning literature uses recall.
05
§ 05 Specificity
Specificity is recall for the negative class. It asks: of all the real negatives, how many did the model correctly leave alone? In medical screening this metric matters enormously — high specificity means the test isn't sending healthy patients for biopsies. It's the rate at which true negatives are correctly identified.
06
§ 06 F1 score
Precision and recall pull in opposite directions, and no single one tells the whole story. F1 is the referee — one number that goes up only when both go up together. It is the harmonic mean of the two, which has a useful property: it is always closer to the smaller of the two inputs. Average a precision of 0.9 with a recall of 0.1 the ordinary way and you get 0.5; the harmonic mean of the same two numbers is 0.18, which is much closer to the truth about what the model can actually do.
07
§ 07 ROC and AUC
Every metric above depends on a chosen threshold — usually 0.5 — that turns the model's continuous score into a yes-or-no decision. But the threshold is itself a hyperparameter; changing it produces a different confusion matrix and different metric values. The ROC curve sweeps the threshold across every possible value and plots the result. Each point is a (false positive rate, true positive rate) pair at one threshold. The area under the curve, AUC, summarizes the whole sweep in a single number on [0, 1].
08
§ 08 Classification reference
One practical rule. Never report a single number alone — pair precision with recall, or sensitivity with specificity, or accuracy with a class-balance figure. The single-number temptation is exactly what produces 99%-accurate cancer detectors that miss every cancer.
09
§ 09 The three sums of squares
Where classification splits errors into four discrete cells, regression has continuous errors. Predictions f(x) are real numbers, truth values y are real numbers, and the error is a distance — a positive or negative gap, squared for the math to work out cleanly. Almost every regression metric on this page is built from one of three sums of squares.
10
§ 10 R² — the coefficient of determination
R² collapses the three sums of squares into a single number on a meaningful scale. Its definition is built around a baseline comparison: what would happen if we just predicted ȳ for every input? That dumb model has residuals equal to SS tot exactly. So the ratio SS res / SS tot tells you what fraction of the baseline's error your model still has left over. Subtract from 1 and you get the fraction you removed.
11
§ 11 Adjusted R²
The fix is to penalize R² by the number of parameters. Adjusted R² rescales the unexplained-variance ratio by degrees of freedom. Instead of dividing by m and m , it divides by m − 1 and m − n − 1 , where n is the number of predictors. Adding a useless feature shrinks the denominator faster than it shrinks the numerator, so the metric drops.
12
§ 12 Mallow's Cp
Mallow's Cp is the historical ancestor of the model-selection family. It compares a model's residual sum of squares against an estimate of the true noise variance σ̂² , usually taken from a "full" reference model with all candidate features. The diagnostic is precise: a well-calibrated submodel has Cp close to n + 1 . Values much larger indicate the model can't explain structure beyond noise — it's underfitting.
13
§ 13 AIC — Akaike Information Criterion
AIC takes a different route. Instead of starting from R² or sums of squares, it starts from the model's likelihood L — how probable the observed data is under the fitted model — and penalizes by twice the number of parameters. The 2 in the penalty isn't an ad-hoc fudge; it comes from the expected Kullback-Leibler divergence between the true and estimated distributions. Lower AIC is better, and absolute values matter only in comparisons between models on the same data.
14
§ 14 BIC — Bayesian Information Criterion
BIC has the same shape as AIC but a heavier penalty that grows with sample size. Where AIC uses 2 per parameter, BIC uses log(m) . With many samples, BIC becomes increasingly conservative and prefers simpler models. AIC and BIC often pick the same model on small data and diverge on large data, with BIC consistently picking the smaller of the two.
15
§ 15 Model selection in action
The four metrics above all attack the same problem from slightly different angles. The interactive below puts them on the same chart so you can watch them disagree and agree. The setup: 50 synthetic data points generated from a true model that depends on exactly two features. The slider controls how many features the regression uses — slide right and the model gets features 1, 2, then 3, 4, ..., up to 10. Features 3 through 10 are pure noise.
16
§ 16 Regression reference
A pragmatic discipline that consistently outperforms blind allegiance to any single metric: look at several at once, treat large disagreements as signals that something is wrong, and validate the final model on data it never saw during selection.