CS 229 · Machine Learning · Companion notes

Classification & regression metrics,
visualized.

A working tour of how every metric on the CS 229 tips-and-tricks cheat sheet is built, what motivated it, what it does well, and where it breaks down.

13 metrics, 5 interactive demos Approx. 25 min read
Part I
Classification
Discrete outcomes, confusion matrices, threshold sweeps. The four-way split underneath every binary classifier.

§ 01The confusion matrix

Classification predictions split four ways. The model says yes or no; the truth is yes or no. Every prediction lands in one of four cells. A true positive is a yes-yes — the model correctly raised the alarm. A false negative is a no-yes — the model missed something real. A false positive is a yes-no — the model raised a false alarm. A true negative is a no-no — quietly correct, the most common cell in most real systems.

Every classification metric on this page is a different way of dividing these four numbers. Choosing a metric means choosing what kind of mistake you care about.

Drag the four sliders below to imagine a cancer-screening study on 1,000 patients. About 10% truly have the disease. The cells start at a reasonable model — high recall, moderate precision — and you can dial each cell up or down to see how the downstream metrics react.

Demo 01 · Confusion matrix explorer
Predicted
Actual
+
TP85
FN15
FP40
TN860
TP85
caught
FN15
missed
FP40
false alarms
TN860
correct rejects
Accuracy
94.5%
(TP+TN)/all
Precision
0.68
TP/(TP+FP)
Recall
0.85
TP/(TP+FN)
Specificity
0.96
TN/(TN+FP)
F1
0.76
2PR/(P+R)
Try a different model

§ 02Accuracy

The most obvious question to ask a classifier is: how often is it right? Accuracy answers exactly that — it's the sum of the two diagonal cells of the confusion matrix divided by every prediction made. The formula is symmetric, treating false positives and false negatives identically. That simplicity is its biggest virtue and its biggest trap.

Accuracy = (TP + TN) / (TP + FP + FN + TN) Fraction of predictions that are correct.
Merits

The most intuitive metric available. Reads like a school grade — a single number on a clean [0, 1] scale that requires no statistical background to understand.

Defined identically for any number of classes, not just binary problems. Works as a fair summary when classes are roughly balanced and different errors carry similar costs.

Demerits

Lies fluently on imbalanced data. If 1% of patients have a disease, a model that predicts "no" for everyone scores 99% — useless, but glowingly accurate by this metric.

Cannot distinguish a model that catches every positive from one that catches none, as long as their total error rates match. Hides which kind of mistake the model makes.

§ 03Precision

Once accuracy fails on imbalanced data, you need a metric that focuses only on the predictions you actually care about — the positive ones. Precision asks the question every practitioner eventually faces: when this model says yes, how often is it right? The denominator is everything the model labeled positive, so the metric is unaffected by however many negatives sit quietly in the data.

Precision = TP / (TP + FP) Of the positive predictions, what fraction were correct.
Merits

Directly answers the question that matters whenever false alarms are expensive — spam filters, content moderation, search results, ad targeting.

Robust to class imbalance because its denominator never grows just because the negative class is large. The metric is built around the positive class.

Demerits

Completely ignores false negatives. A model that flags only its single most confident case can achieve perfect precision while being catastrophically incomplete.

Sees only one column of the confusion matrix, so it must always be paired with recall to give a meaningful picture of a model's behavior.

§ 04Recall (Sensitivity)

Recall flips the question. Where precision asks about the model's predictions, recall asks about the truth: of all the actual positives out there, how many did the model catch? Its denominator is fixed by reality — it depends on how many positives exist, not on how many alarms the model raised. The medical literature calls this same quantity sensitivity; the machine learning literature uses recall.

Recall = TP / (TP + FN) Also: Sensitivity, True Positive Rate. Of the real positives, what fraction were caught.
Merits

The right metric whenever missing a positive is costly — undetected cancer, undetected fraud, undetected threats. Aligns with what high-stakes systems actually optimize.

Robust to class imbalance for the same reason precision is — its denominator depends only on the positive class, which doesn't move when negatives proliferate.

Demerits

Ignores false positives entirely. A model that simply predicts "yes" for everyone has perfect recall — and is useless.

Like precision, it sees only one column of the confusion matrix. Must be evaluated alongside something that constrains false alarms — precision, specificity, or a fixed false-positive budget.

§ 05Specificity

Specificity is recall for the negative class. It asks: of all the real negatives, how many did the model correctly leave alone? In medical screening this metric matters enormously — high specificity means the test isn't sending healthy patients for biopsies. It's the rate at which true negatives are correctly identified.

Specificity = TN / (TN + FP) Also: True Negative Rate. Of the real negatives, what fraction were correctly identified.
Merits

Provides the counterpart to sensitivity and is essential for fully characterizing a binary classifier. Sensitivity and specificity together describe the two error rates in their natural form.

Central to diagnostic medicine and to ROC analysis, where 1 minus specificity is the false positive rate plotted on the horizontal axis.

Demerits

Easy to dismiss on heavily imbalanced data. When 99% of cases are negative, getting 99% of them right is the floor, not the ceiling.

Doesn't appear in F1 or in precision-recall curves, which leads practitioners working on imbalanced problems to forget about it entirely.

§ 06F1 score

Precision and recall pull in opposite directions, and no single one tells the whole story. F1 is the referee — one number that goes up only when both go up together. It is the harmonic mean of the two, which has a useful property: it is always closer to the smaller of the two inputs. Average a precision of 0.9 with a recall of 0.1 the ordinary way and you get 0.5; the harmonic mean of the same two numbers is 0.18, which is much closer to the truth about what the model can actually do.

F1 = 2 · P · R / (P + R)  =  2·TP / (2·TP + FP + FN) Harmonic mean of precision and recall. Notice TN does not appear.

The interactive below makes the trade-off visible. The contour plot on the right shows F1 across every possible (precision, recall) pair — darker regions are higher F1. Notice the dark band hugs the diagonal where precision equals recall. Move perpendicular to that diagonal (push one up while pushing the other down) and F1 drops fast.

Demo 02 · F1 score and the harmonic mean
TP60
FN40
FP30
Metrics
Precision
0.67
TP/(TP+FP)
Recall
0.60
TP/(TP+FN)
F1 (harmonic mean)
0.63
2PR/(P+R)
Arithmetic mean (for comparison)
0.63
(P+R)/2
F1 across the P–R plane
0Recall →1
Try a model
Merits

Combines precision and recall into one number that cannot be gamed by sacrificing one for the other. Going up on F1 requires improving both.

Critically, F1 excludes the true-negative count entirely, making it robust to class imbalance. Inflating the negative class arbitrarily does not affect F1. This is why it dominates benchmarks on imbalanced data.

Demerits

Weights precision and recall equally, which is rarely what the real world demands. Medical screening cares more about recall; spam filtering more about precision. For those, use F-beta with β ≠ 1.

Less interpretable than its components. F1 = 0.6 might mean P = R = 0.6, or P = 0.99 with R = 0.43 — very different models that the single number cannot distinguish.

§ 07ROC and AUC

Every metric above depends on a chosen threshold — usually 0.5 — that turns the model's continuous score into a yes-or-no decision. But the threshold is itself a hyperparameter; changing it produces a different confusion matrix and different metric values. The ROC curve sweeps the threshold across every possible value and plots the result. Each point is a (false positive rate, true positive rate) pair at one threshold. The area under the curve, AUC, summarizes the whole sweep in a single number on [0, 1].

ROC: (FPR, TPR) for every threshold  ·  AUC = ∫₀¹ TPR d(FPR) FPR = FP/(FP+TN), TPR = TP/(TP+FN). AUC = probability that a random positive scores higher than a random negative.

The demo below shows two overlapping score distributions — the model's outputs on negative cases in red, on positive cases in blue. The vertical line is the threshold. Above the threshold, the model says yes; below it, no. As you drag the threshold, watch the (FPR, TPR) point trace out the ROC curve. The further apart the two distributions are, the better the classifier — try the separation slider to see what that looks like.

Demo 03 · ROC curve and the threshold sweep
Threshold0.50
drag to scan the curve
Class separation0.40
distance between distribution centers
Score distributions
Negatives Positives Threshold
ROC curve
TPR (Recall)
TP/(TP+FN)
FPR
FP/(FP+TN)
Precision
TP/(TP+FP)
AUC
area under curve
Merits

Threshold-independent. ROC describes the underlying score distributions, not the operating point you happen to have picked. The same model produces the same curve regardless of how you set the decision boundary.

AUC has a clean probabilistic interpretation — the probability that a randomly chosen positive scores higher than a randomly chosen negative. This makes it ideal for ranking problems, model comparison, and reporting in literature.

Demerits

Over-optimistic on imbalanced data. FPR uses the large TN count in its denominator, so even many false positives barely move the curve. On heavily skewed problems, the precision-recall curve is more honest.

Conflates models with different shapes. Two ROC curves with identical AUC can offer very different operating-point trade-offs. A high AUC does not guarantee that any specific threshold is useful for your application.

§ 08Classification reference

Metric Formula When to reach for it
Accuracy(TP+TN) / total Fraction of all predictions that are correct. Quick summary on balanced problems where all errors cost the same. Avoid on imbalanced data.
PrecisionTP / (TP + FP) Of positive predictions, the fraction that are correct. When false alarms are expensive — spam filters, content moderation, search ranking.
Recall (Sensitivity)TP / (TP + FN) Of real positives, the fraction caught. When missed positives are costly — disease, fraud, security threats.
SpecificityTN / (TN + FP) Of real negatives, the fraction correctly identified. Diagnostic medicine; the partner of sensitivity in ROC analysis.
F1 score2·TP / (2·TP + FP + FN) Harmonic mean of precision and recall. Imbalanced classification benchmarks. Use F-beta when one of P or R matters more.
ROC / AUC∫ TPR d(FPR) Threshold-independent summary of classifier ranking quality. Comparing models across all thresholds. Prefer PR-AUC on heavily imbalanced problems.

One practical rule. Never report a single number alone — pair precision with recall, or sensitivity with specificity, or accuracy with a class-balance figure. The single-number temptation is exactly what produces 99%-accurate cancer detectors that miss every cancer.

Part II
Regression
Continuous outcomes, sums of squares, and the model-selection metrics that prevent overfitting.

§ 09The three sums of squares

Where classification splits errors into four discrete cells, regression has continuous errors. Predictions f(x) are real numbers, truth values y are real numbers, and the error is a distance — a positive or negative gap, squared for the math to work out cleanly. Almost every regression metric on this page is built from one of three sums of squares.

Each yᵢ decomposes into two pieces relative to the mean ȳ: what the model predicted, and what it missed. Squared and summed, these pieces are the building blocks of every regression metric.

Drag the line below to control your model's slope and intercept. Toggle the colored segments to see each "sum of squares" appear literally on the chart — vertical bars whose squared lengths sum to the named quantity.

Demo 04 · Variance decomposition
Your fit f(x) Mean ȳ Data yᵢ
Slope0.82
Intercept1.40
SSres
5.67
Σ(y−f(x))²
SSreg
55.24
Σ(f(x)−ȳ)²
SStot
60.90
Σ(y−ȳ)²
0.907
1 − SSres/SStot
Your line explains 91% of the total variance.
0% (R² = 0)100% (R² = 1)
Try a line

SStot is a fact about the data, not the model — it's the total variation in y around its mean. SSreg is how much variation the model introduces by tilting away from a flat line at ȳ. SSres is the part the model couldn't reach — the residuals OLS literally minimizes. For an OLS fit specifically, these three obey an exact identity: SStot = SSreg + SSres. Drag the line away from OLS and you'll see the identity break.

§ 10R² — the coefficient of determination

R² collapses the three sums of squares into a single number on a meaningful scale. Its definition is built around a baseline comparison: what would happen if we just predicted ȳ for every input? That dumb model has residuals equal to SStot exactly. So the ratio SSres / SStot tells you what fraction of the baseline's error your model still has left over. Subtract from 1 and you get the fraction you removed.

R² = 1 − SSres / SStot The fraction of variance the model explains.
Merits

Uniform scale across datasets — a 0.8 always means "explained 80% of the variance," and the interpretation is immediately legible to non-specialists.

Works for any model, not just OLS, and the same formula applies on training data, validation data, or test data. The universal headline metric in regression.

Demerits

Fatally flawed for model selection: adding features can only ever increase R² on training data, even useless features that fit pure noise. Always tells you "more is better" — exactly wrong for generalization.

Can be negative on held-out data — most practitioners are surprised the first time they see it. A model worse than predicting ȳ produces negative R².

§ 11Adjusted R²

The fix is to penalize R² by the number of parameters. Adjusted R² rescales the unexplained-variance ratio by degrees of freedom. Instead of dividing by m and m, it divides by m − 1 and m − n − 1, where n is the number of predictors. Adding a useless feature shrinks the denominator faster than it shrinks the numerator, so the metric drops.

Adjusted R² = 1 − (1 − R²)(m − 1) / (m − n − 1) R² with a complexity tax. Drops when added features don't earn their degree of freedom.
Merits

Directly addresses R²'s overfitting blind spot — drops when a new feature doesn't pull its weight, rises when it does. The simplest possible correction.

On the same interpretable scale as R², so it's easy to communicate. Cheap to compute given R² and the model dimensions.

Demerits

Only meaningful for comparing nested linear models on the same dataset. The penalty is ad-hoc, not derived from information theory or Bayesian considerations.

Not bounded below at 0 — with enough useless features it can go arbitrarily negative. Provides weaker selection guidance than cross-validation, which measures held-out performance directly.

§ 12Mallow's Cp

Mallow's Cp is the historical ancestor of the model-selection family. It compares a model's residual sum of squares against an estimate of the true noise variance σ̂², usually taken from a "full" reference model with all candidate features. The diagnostic is precise: a well-calibrated submodel has Cp close to n + 1. Values much larger indicate the model can't explain structure beyond noise — it's underfitting.

Cp = (SSres + 2(n + 1)σ̂²) / m An unbiased estimate of test mean squared error in y-units.
Merits

Has a clean interpretable diagnostic — "Cp ≈ n + 1" gives an immediate calibration check, unlike AIC or BIC whose absolute values aren't meaningful in isolation.

An unbiased estimate of out-of-sample mean squared error under linear regression assumptions. Specifically designed for subset selection in linear models.

Demerits

Requires an estimate of σ², which means fitting a "full" model first. That's expensive and forces you to commit to a candidate feature set in advance.

Assumes linear regression with Gaussian errors. Doesn't generalize to non-linear models the way AIC and BIC do, which limits its usefulness outside its original context.

§ 13AIC — Akaike Information Criterion

AIC takes a different route. Instead of starting from R² or sums of squares, it starts from the model's likelihood L — how probable the observed data is under the fitted model — and penalizes by twice the number of parameters. The 2 in the penalty isn't an ad-hoc fudge; it comes from the expected Kullback-Leibler divergence between the true and estimated distributions. Lower AIC is better, and absolute values matter only in comparisons between models on the same data.

AIC = 2(n + 2) − 2 log(L) Likelihood-based fit, complexity penalty grounded in information theory.
Merits

Works for any model with a likelihood — linear regression, logistic regression, GLMs, mixture models, time series. Not restricted to OLS the way Cp and adjusted R² are.

Rigorous theoretical foundations rooted in information theory. Provides absolute numbers, so non-nested models can be compared directly.

Demerits

Tends to choose slightly larger models than is optimal. It is not "consistent" — even with infinite data, it has a small probability of preferring an overfit model over the true one.

Derivation assumes the true model is not in the candidate set; if it is, BIC has stronger theoretical justification. Requires correct likelihood specification, which can be a strong assumption.

§ 14BIC — Bayesian Information Criterion

BIC has the same shape as AIC but a heavier penalty that grows with sample size. Where AIC uses 2 per parameter, BIC uses log(m). With many samples, BIC becomes increasingly conservative and prefers simpler models. AIC and BIC often pick the same model on small data and diverge on large data, with BIC consistently picking the smaller of the two.

BIC = log(m) · (n + 2) − 2 log(L) Stricter penalty than AIC. Asymptotically selects the true model under regularity conditions.
Merits

Consistent: under regularity conditions, BIC asymptotically selects the true model whenever the true model is in the candidate set. This is a stronger guarantee than AIC offers.

Favors parsimony, which often produces more interpretable models. Has a clean Bayesian interpretation as an approximation to the marginal likelihood of the model.

Demerits

Can be too aggressive on small samples, choosing models that underfit. The log(m) penalty grows slowly but is already strict at m = 50 or so.

Assumes the true model is in the candidate set, which is rarely true in practice. When no candidate is correct, AIC's predictive emphasis has stronger justification.

§ 15Model selection in action

The four metrics above all attack the same problem from slightly different angles. The interactive below puts them on the same chart so you can watch them disagree and agree. The setup: 50 synthetic data points generated from a true model that depends on exactly two features. The slider controls how many features the regression uses — slide right and the model gets features 1, 2, then 3, 4, ..., up to 10. Features 3 through 10 are pure noise.

Demo 05 · The penalty principle
2
Two features used — the true model. Features beyond this point are pure noise.
Adjusted R² AIC (rescaled) BIC (rescaled) Mallow's Cp (rescaled)
Never decreases as features are added — useless for model selection on its own.
Adjusted R²
Penalizes each new feature; drops when the added feature isn't worth its degree of freedom.
AIC
Lower is better. Penalty 2(n+2) — moderate; tends to pick slightly larger models.
BIC
Lower is better. Penalty log(m)(n+2) — stricter; prefers simpler models as m grows.
Mallow's Cp
A well-calibrated model has Cp ≈ n+1. Far above means underfitting; well below can indicate overfit on training data.

The pattern is what you should look for in your own work. R² climbs the whole way, hiding the overfit. Adjusted R² peaks at k=2 and starts dropping. AIC and BIC bottom out at k=2 and start rising. Mallow's Cp tracks (n+1) closely at k=2 and grows worse with the wrong number of features. The gap between R² and the penalized metrics is the overfitting signature — when R² says "great" and AIC says "no", trust AIC.

§ 16Regression reference

Metric Formula When to reach for it
SStotΣ(yᵢ − ȳ)² Total variation in the outcome. As the denominator in R²; as a "budget" check on data variability.
SSregΣ(f(xᵢ) − ȳ)² Variation introduced by the model. Diagnostic; for OLS equals SStot − SSres and powers R².
SSresΣ(yᵢ − f(xᵢ))² Variation left unexplained. The quantity OLS minimizes. Track directly during training.
1 − SSres/SStot Fraction of variance explained. Reporting a model's headline fit. Beware on held-out data — can go negative.
Adjusted R²1 − (1−R²)(m−1)/(m−n−1) R² with complexity penalty. Comparing nested linear models of different sizes.
Mallow's Cp(SSres + 2(n+1)σ̂²) / m Estimated test MSE in y-units. Classical subset selection in linear regression with a reference σ̂².
AIC2(n+2) − 2 log(L) Likelihood with moderate complexity penalty. Non-nested model comparison with predictive emphasis. Works beyond OLS.
BIClog(m)(n+2) − 2 log(L) Likelihood with stricter sample-size-aware penalty. When consistency matters and the true model is plausibly in the candidate set.

A pragmatic discipline that consistently outperforms blind allegiance to any single metric: look at several at once, treat large disagreements as signals that something is wrong, and validate the final model on data it never saw during selection.