A working tour of how every metric on the CS 229 tips-and-tricks cheat sheet is built, what motivated it, what it does well, and where it breaks down.
Classification predictions split four ways. The model says yes or no; the truth is yes or no. Every prediction lands in one of four cells. A true positive is a yes-yes — the model correctly raised the alarm. A false negative is a no-yes — the model missed something real. A false positive is a yes-no — the model raised a false alarm. A true negative is a no-no — quietly correct, the most common cell in most real systems.
Every classification metric on this page is a different way of dividing these four numbers. Choosing a metric means choosing what kind of mistake you care about.
Drag the four sliders below to imagine a cancer-screening study on 1,000 patients. About 10% truly have the disease. The cells start at a reasonable model — high recall, moderate precision — and you can dial each cell up or down to see how the downstream metrics react.
The most obvious question to ask a classifier is: how often is it right? Accuracy answers exactly that — it's the sum of the two diagonal cells of the confusion matrix divided by every prediction made. The formula is symmetric, treating false positives and false negatives identically. That simplicity is its biggest virtue and its biggest trap.
The most intuitive metric available. Reads like a school grade — a single number on a clean [0, 1] scale that requires no statistical background to understand.
Defined identically for any number of classes, not just binary problems. Works as a fair summary when classes are roughly balanced and different errors carry similar costs.
Lies fluently on imbalanced data. If 1% of patients have a disease, a model that predicts "no" for everyone scores 99% — useless, but glowingly accurate by this metric.
Cannot distinguish a model that catches every positive from one that catches none, as long as their total error rates match. Hides which kind of mistake the model makes.
Once accuracy fails on imbalanced data, you need a metric that focuses only on the predictions you actually care about — the positive ones. Precision asks the question every practitioner eventually faces: when this model says yes, how often is it right? The denominator is everything the model labeled positive, so the metric is unaffected by however many negatives sit quietly in the data.
Directly answers the question that matters whenever false alarms are expensive — spam filters, content moderation, search results, ad targeting.
Robust to class imbalance because its denominator never grows just because the negative class is large. The metric is built around the positive class.
Completely ignores false negatives. A model that flags only its single most confident case can achieve perfect precision while being catastrophically incomplete.
Sees only one column of the confusion matrix, so it must always be paired with recall to give a meaningful picture of a model's behavior.
Recall flips the question. Where precision asks about the model's predictions, recall asks about the truth: of all the actual positives out there, how many did the model catch? Its denominator is fixed by reality — it depends on how many positives exist, not on how many alarms the model raised. The medical literature calls this same quantity sensitivity; the machine learning literature uses recall.
The right metric whenever missing a positive is costly — undetected cancer, undetected fraud, undetected threats. Aligns with what high-stakes systems actually optimize.
Robust to class imbalance for the same reason precision is — its denominator depends only on the positive class, which doesn't move when negatives proliferate.
Ignores false positives entirely. A model that simply predicts "yes" for everyone has perfect recall — and is useless.
Like precision, it sees only one column of the confusion matrix. Must be evaluated alongside something that constrains false alarms — precision, specificity, or a fixed false-positive budget.
Specificity is recall for the negative class. It asks: of all the real negatives, how many did the model correctly leave alone? In medical screening this metric matters enormously — high specificity means the test isn't sending healthy patients for biopsies. It's the rate at which true negatives are correctly identified.
Provides the counterpart to sensitivity and is essential for fully characterizing a binary classifier. Sensitivity and specificity together describe the two error rates in their natural form.
Central to diagnostic medicine and to ROC analysis, where 1 minus specificity is the false positive rate plotted on the horizontal axis.
Easy to dismiss on heavily imbalanced data. When 99% of cases are negative, getting 99% of them right is the floor, not the ceiling.
Doesn't appear in F1 or in precision-recall curves, which leads practitioners working on imbalanced problems to forget about it entirely.
Precision and recall pull in opposite directions, and no single one tells the whole story. F1 is the referee — one number that goes up only when both go up together. It is the harmonic mean of the two, which has a useful property: it is always closer to the smaller of the two inputs. Average a precision of 0.9 with a recall of 0.1 the ordinary way and you get 0.5; the harmonic mean of the same two numbers is 0.18, which is much closer to the truth about what the model can actually do.
The interactive below makes the trade-off visible. The contour plot on the right shows F1 across every possible (precision, recall) pair — darker regions are higher F1. Notice the dark band hugs the diagonal where precision equals recall. Move perpendicular to that diagonal (push one up while pushing the other down) and F1 drops fast.
Combines precision and recall into one number that cannot be gamed by sacrificing one for the other. Going up on F1 requires improving both.
Critically, F1 excludes the true-negative count entirely, making it robust to class imbalance. Inflating the negative class arbitrarily does not affect F1. This is why it dominates benchmarks on imbalanced data.
Weights precision and recall equally, which is rarely what the real world demands. Medical screening cares more about recall; spam filtering more about precision. For those, use F-beta with β ≠ 1.
Less interpretable than its components. F1 = 0.6 might mean P = R = 0.6, or P = 0.99 with R = 0.43 — very different models that the single number cannot distinguish.
Every metric above depends on a chosen threshold — usually 0.5 — that turns the model's continuous score into a yes-or-no decision. But the threshold is itself a hyperparameter; changing it produces a different confusion matrix and different metric values. The ROC curve sweeps the threshold across every possible value and plots the result. Each point is a (false positive rate, true positive rate) pair at one threshold. The area under the curve, AUC, summarizes the whole sweep in a single number on [0, 1].
The demo below shows two overlapping score distributions — the model's outputs on negative cases in red, on positive cases in blue. The vertical line is the threshold. Above the threshold, the model says yes; below it, no. As you drag the threshold, watch the (FPR, TPR) point trace out the ROC curve. The further apart the two distributions are, the better the classifier — try the separation slider to see what that looks like.
Threshold-independent. ROC describes the underlying score distributions, not the operating point you happen to have picked. The same model produces the same curve regardless of how you set the decision boundary.
AUC has a clean probabilistic interpretation — the probability that a randomly chosen positive scores higher than a randomly chosen negative. This makes it ideal for ranking problems, model comparison, and reporting in literature.
Over-optimistic on imbalanced data. FPR uses the large TN count in its denominator, so even many false positives barely move the curve. On heavily skewed problems, the precision-recall curve is more honest.
Conflates models with different shapes. Two ROC curves with identical AUC can offer very different operating-point trade-offs. A high AUC does not guarantee that any specific threshold is useful for your application.
| Metric | Formula | When to reach for it |
|---|---|---|
| Accuracy(TP+TN) / total | Fraction of all predictions that are correct. | Quick summary on balanced problems where all errors cost the same. Avoid on imbalanced data. |
| PrecisionTP / (TP + FP) | Of positive predictions, the fraction that are correct. | When false alarms are expensive — spam filters, content moderation, search ranking. |
| Recall (Sensitivity)TP / (TP + FN) | Of real positives, the fraction caught. | When missed positives are costly — disease, fraud, security threats. |
| SpecificityTN / (TN + FP) | Of real negatives, the fraction correctly identified. | Diagnostic medicine; the partner of sensitivity in ROC analysis. |
| F1 score2·TP / (2·TP + FP + FN) | Harmonic mean of precision and recall. | Imbalanced classification benchmarks. Use F-beta when one of P or R matters more. |
| ROC / AUC∫ TPR d(FPR) | Threshold-independent summary of classifier ranking quality. | Comparing models across all thresholds. Prefer PR-AUC on heavily imbalanced problems. |
One practical rule. Never report a single number alone — pair precision with recall, or sensitivity with specificity, or accuracy with a class-balance figure. The single-number temptation is exactly what produces 99%-accurate cancer detectors that miss every cancer.
Where classification splits errors into four discrete cells, regression has continuous errors. Predictions f(x) are real numbers, truth values y are real numbers, and the error is a distance — a positive or negative gap, squared for the math to work out cleanly. Almost every regression metric on this page is built from one of three sums of squares.
Each yᵢ decomposes into two pieces relative to the mean ȳ: what the model predicted, and what it missed. Squared and summed, these pieces are the building blocks of every regression metric.
Drag the line below to control your model's slope and intercept. Toggle the colored segments to see each "sum of squares" appear literally on the chart — vertical bars whose squared lengths sum to the named quantity.
SStot is a fact about the data, not the model — it's the total variation in y around its mean. SSreg is how much variation the model introduces by tilting away from a flat line at ȳ. SSres is the part the model couldn't reach — the residuals OLS literally minimizes. For an OLS fit specifically, these three obey an exact identity: SStot = SSreg + SSres. Drag the line away from OLS and you'll see the identity break.
R² collapses the three sums of squares into a single number on a meaningful scale. Its definition is built around a baseline comparison: what would happen if we just predicted ȳ for every input? That dumb model has residuals equal to SStot exactly. So the ratio SSres / SStot tells you what fraction of the baseline's error your model still has left over. Subtract from 1 and you get the fraction you removed.
Uniform scale across datasets — a 0.8 always means "explained 80% of the variance," and the interpretation is immediately legible to non-specialists.
Works for any model, not just OLS, and the same formula applies on training data, validation data, or test data. The universal headline metric in regression.
Fatally flawed for model selection: adding features can only ever increase R² on training data, even useless features that fit pure noise. Always tells you "more is better" — exactly wrong for generalization.
Can be negative on held-out data — most practitioners are surprised the first time they see it. A model worse than predicting ȳ produces negative R².
The fix is to penalize R² by the number of parameters. Adjusted R² rescales the unexplained-variance ratio by degrees of freedom. Instead of dividing by m and m, it divides by m − 1 and m − n − 1, where n is the number of predictors. Adding a useless feature shrinks the denominator faster than it shrinks the numerator, so the metric drops.
Directly addresses R²'s overfitting blind spot — drops when a new feature doesn't pull its weight, rises when it does. The simplest possible correction.
On the same interpretable scale as R², so it's easy to communicate. Cheap to compute given R² and the model dimensions.
Only meaningful for comparing nested linear models on the same dataset. The penalty is ad-hoc, not derived from information theory or Bayesian considerations.
Not bounded below at 0 — with enough useless features it can go arbitrarily negative. Provides weaker selection guidance than cross-validation, which measures held-out performance directly.
Mallow's Cp is the historical ancestor of the model-selection family. It compares a model's residual sum of squares against an estimate of the true noise variance σ̂², usually taken from a "full" reference model with all candidate features. The diagnostic is precise: a well-calibrated submodel has Cp close to n + 1. Values much larger indicate the model can't explain structure beyond noise — it's underfitting.
Has a clean interpretable diagnostic — "Cp ≈ n + 1" gives an immediate calibration check, unlike AIC or BIC whose absolute values aren't meaningful in isolation.
An unbiased estimate of out-of-sample mean squared error under linear regression assumptions. Specifically designed for subset selection in linear models.
Requires an estimate of σ², which means fitting a "full" model first. That's expensive and forces you to commit to a candidate feature set in advance.
Assumes linear regression with Gaussian errors. Doesn't generalize to non-linear models the way AIC and BIC do, which limits its usefulness outside its original context.
AIC takes a different route. Instead of starting from R² or sums of squares, it starts from the model's likelihood L — how probable the observed data is under the fitted model — and penalizes by twice the number of parameters. The 2 in the penalty isn't an ad-hoc fudge; it comes from the expected Kullback-Leibler divergence between the true and estimated distributions. Lower AIC is better, and absolute values matter only in comparisons between models on the same data.
Works for any model with a likelihood — linear regression, logistic regression, GLMs, mixture models, time series. Not restricted to OLS the way Cp and adjusted R² are.
Rigorous theoretical foundations rooted in information theory. Provides absolute numbers, so non-nested models can be compared directly.
Tends to choose slightly larger models than is optimal. It is not "consistent" — even with infinite data, it has a small probability of preferring an overfit model over the true one.
Derivation assumes the true model is not in the candidate set; if it is, BIC has stronger theoretical justification. Requires correct likelihood specification, which can be a strong assumption.
BIC has the same shape as AIC but a heavier penalty that grows with sample size. Where AIC uses 2 per parameter, BIC uses log(m). With many samples, BIC becomes increasingly conservative and prefers simpler models. AIC and BIC often pick the same model on small data and diverge on large data, with BIC consistently picking the smaller of the two.
Consistent: under regularity conditions, BIC asymptotically selects the true model whenever the true model is in the candidate set. This is a stronger guarantee than AIC offers.
Favors parsimony, which often produces more interpretable models. Has a clean Bayesian interpretation as an approximation to the marginal likelihood of the model.
Can be too aggressive on small samples, choosing models that underfit. The log(m) penalty grows slowly but is already strict at m = 50 or so.
Assumes the true model is in the candidate set, which is rarely true in practice. When no candidate is correct, AIC's predictive emphasis has stronger justification.
The four metrics above all attack the same problem from slightly different angles. The interactive below puts them on the same chart so you can watch them disagree and agree. The setup: 50 synthetic data points generated from a true model that depends on exactly two features. The slider controls how many features the regression uses — slide right and the model gets features 1, 2, then 3, 4, ..., up to 10. Features 3 through 10 are pure noise.
The pattern is what you should look for in your own work. R² climbs the whole way, hiding the overfit. Adjusted R² peaks at k=2 and starts dropping. AIC and BIC bottom out at k=2 and start rising. Mallow's Cp tracks (n+1) closely at k=2 and grows worse with the wrong number of features. The gap between R² and the penalized metrics is the overfitting signature — when R² says "great" and AIC says "no", trust AIC.
| Metric | Formula | When to reach for it |
|---|---|---|
| SStotΣ(yᵢ − ȳ)² | Total variation in the outcome. | As the denominator in R²; as a "budget" check on data variability. |
| SSregΣ(f(xᵢ) − ȳ)² | Variation introduced by the model. | Diagnostic; for OLS equals SStot − SSres and powers R². |
| SSresΣ(yᵢ − f(xᵢ))² | Variation left unexplained. | The quantity OLS minimizes. Track directly during training. |
| R²1 − SSres/SStot | Fraction of variance explained. | Reporting a model's headline fit. Beware on held-out data — can go negative. |
| Adjusted R²1 − (1−R²)(m−1)/(m−n−1) | R² with complexity penalty. | Comparing nested linear models of different sizes. |
| Mallow's Cp(SSres + 2(n+1)σ̂²) / m | Estimated test MSE in y-units. | Classical subset selection in linear regression with a reference σ̂². |
| AIC2(n+2) − 2 log(L) | Likelihood with moderate complexity penalty. | Non-nested model comparison with predictive emphasis. Works beyond OLS. |
| BIClog(m)(n+2) − 2 log(L) | Likelihood with stricter sample-size-aware penalty. | When consistency matters and the true model is plausibly in the candidate set. |
A pragmatic discipline that consistently outperforms blind allegiance to any single metric: look at several at once, treat large disagreements as signals that something is wrong, and validate the final model on data it never saw during selection.