A working tour of the two function families that drive deep learning — from sigmoid to SwiGLU, from MSE to DPO. What each one is, why it exists, what it's used for, and where it breaks down.
A linear layer computes y = Wx + b. Stack two of them: y = W₂(W₁x + b₁) + b₂ = (W₂W₁)x + (W₂b₁ + b₂). Two linear layers collapse into one. Three collapse into one. A thousand collapse into one. No matter how deep you make a purely linear network, it can only represent linear functions.
The job of an activation function is to break linearity. A single element-wise nonlinearity inserted between layers is enough to make a network a universal approximator — capable, in principle, of representing any continuous function given enough width.
That's the conceptual role. The practical role is harder: an activation must be cheap to compute, easy to differentiate, and produce well-behaved gradients during backpropagation. The history of deep learning is largely the history of finding activations that satisfy all three. Sigmoid saturates and kills gradients. ReLU is fast but kills neurons. GELU and Swish are smooth, gradient-friendly, and now dominate large models.
The demo below plots every common activation on the same axes. Pick a function from the dropdown to see its shape, its derivative, and its key properties at a glance.
The first nonlinearity that mattered. Sigmoid squashes any real number into the open interval (0, 1), making its output interpretable as a probability. This is exactly what early neural networks needed for binary classification — the output of the final neuron is the model's belief that the input belongs to the positive class.
Produces a clean probability interpretation, making it the natural choice for binary output layers and gating mechanisms (which need values in [0, 1] to act as "soft switches").
Smooth everywhere with a derivative that can be expressed in terms of the output itself — convenient for efficient backpropagation.
Vanishing gradients. The derivative is at most 0.25, and approaches zero in the saturated tails. Stack a few layers and gradients shrink exponentially — training stalls.
Not zero-centered. All outputs are positive, which biases the gradient direction during training and slows convergence.
Exponential is relatively expensive compared to ReLU's max(0, x).
Tanh is sigmoid's zero-centered cousin. It maps real numbers to (−1, 1) instead of (0, 1), preserving sign information and centering outputs on zero. For decades this was the default activation in hidden layers, because the centered output produces more balanced gradients and faster convergence than sigmoid.
Zero-centered output dramatically helps gradient flow compared to sigmoid. The peak derivative is 1.0, four times sigmoid's peak.
Still bounded, which keeps activations stable and is critical for recurrent networks where activations are reused across many time steps.
Still saturates in both tails — large positive or negative inputs produce near-zero gradients. Vanishing gradient problem only slightly mitigated, not solved.
Two exponentials per evaluation. Replaced almost everywhere by ReLU and friends in feedforward layers.
The activation that made deep learning practical. ReLU is shockingly simple — output the input if positive, zero otherwise — but this simplicity changed everything. For the positive half, the derivative is exactly 1, so gradients flow through deep stacks without attenuation. Training networks with 50, 100, 500 layers became feasible.
Trivially cheap to compute — a single comparison. Drives most of the speed gains in modern deep learning.
Non-saturating for positive inputs. Gradient of exactly 1 means deep networks train without vanishing gradients in the forward direction.
Induces sparse activations — roughly half of all neurons output zero for any given input, providing a form of natural regularization and computational efficiency.
Dying ReLU problem. A neuron whose pre-activation becomes very negative gets stuck at zero output and zero gradient — permanently dead. Large learning rates can kill 40%+ of neurons.
Not differentiable at x = 0 (subgradient used in practice). Outputs are unbounded above, which can lead to exploding activations without normalization.
Not zero-centered.
Leaky ReLU was the first attempt to address dying neurons. Instead of zeroing negative inputs entirely, it lets a small fraction through — typically 0.01. The negative side still has a gradient, so dead neurons can come back to life. PReLU (Parametric ReLU) takes this one step further by making the slope a learnable parameter, optimized jointly with the network weights.
Eliminates the dying neuron problem at essentially zero additional compute. The small slope on the negative side keeps gradients flowing.
PReLU lets the network learn the optimal slope per channel, sometimes giving small but consistent accuracy improvements.
The leak coefficient α is arbitrary — 0.01 is conventional but not principled. PReLU partially fixes this but adds learnable parameters and complicates regularization.
Empirical improvements over ReLU are inconsistent. Many practitioners find no meaningful gain on standard image classification benchmarks.
ELU (Exponential Linear Unit) uses an exponential curve on the negative side, smoothly approaching −α as x → −∞. This is bounded — unlike Leaky ReLU's unbounded negative side — and smooth at zero. SELU (Scaled ELU) carefully chooses α and a scaling factor λ such that activations through deep networks self-normalize to zero mean and unit variance, removing the need for batch normalization.
ELU's negative saturation makes outputs robust to noise. Smooth at zero, so optimization landscapes are slightly nicer than ReLU's.
SELU enables training very deep networks without batch normalization, when paired with proper weight initialization (LeCun normal) and alpha-dropout.
Exponential is more expensive than ReLU's comparison. In practice, ELU is rarely chosen over ReLU on speed-sensitive workloads.
SELU's self-normalization only works under very specific conditions — particular initialization, particular dropout variant, no skip connections. Brittle and largely superseded by LayerNorm/BatchNorm in practice.
GELU is where activation functions enter the era of large language models. Introduced in 2016 and adopted by BERT, GPT-2, GPT-3, and most transformer encoders, it's a smooth nonlinearity that weights inputs by their value under the standard normal CDF. The intuition is that GELU multiplies x by the probability that a Gaussian draw exceeds zero — a probabilistic version of ReLU's hard gate.
Smooth everywhere (including at x = 0), which optimization theory mildly prefers. The non-monotonic dip near x = −1 lets it represent slightly more complex behaviors than ReLU.
Empirically outperforms ReLU on transformer architectures, especially at scale. Became the default for transformer feedforward layers throughout 2018-2021.
Considerably more expensive than ReLU — the exact form needs an erf evaluation, the approximate form needs tanh plus a cubic.
The advantage over ReLU shrinks with very large models and may not justify the compute cost. Newer architectures favor SwiGLU instead.
Discovered by Google's neural architecture search in 2017, Swish is even simpler than GELU but behaves similarly. It's just x · σ(x) — the input gated by its own sigmoid. PyTorch calls the same function SiLU (Sigmoid Linear Unit). The two names refer to the same function, with Swish sometimes generalized to include a learnable parameter β: x · σ(βx).
Slightly cheaper than GELU's approximation (just sigmoid times input). Empirically matches or beats GELU on most large-model benchmarks.
The non-monotonic dip allows a tiny amount of negative output for moderately negative inputs, providing more expressive gradients than ReLU.
Still more expensive than ReLU. The gains are real but small (often < 1% accuracy) and may not justify the cost in inference-sensitive applications.
Sigmoid evaluation is required at every forward pass, which on mobile hardware can be a meaningful bottleneck.
Mish takes the Swish idea further by replacing the sigmoid gate with a softplus-and-tanh composition. The result is a smooth, non-monotonic curve similar to Swish but with slightly different behavior in the negative tail. Mish gained attention through the YOLOv4 object detection paper, where replacing Leaky ReLU with Mish gave consistent improvements.
Reported empirical gains over Swish and GELU on computer vision benchmarks, especially in object detection where YOLOv4 became its flagship use.
More expensive than both Swish and GELU. The composition of softplus and tanh is computationally heavier without consistent gains elsewhere.
Not widely adopted outside vision. Transformer-based architectures have largely settled on GELU and SwiGLU.
Softmax is the only activation on this page that operates over a vector rather than element-wise. It takes a vector of arbitrary real numbers ("logits") and produces a probability distribution — non-negative entries that sum to one. This is what turns a neural network's raw output into a categorical distribution over classes, words, or actions.
The temperature parameter T controls the sharpness of the resulting distribution. At T = 1 it's the standard softmax. At T → 0 the distribution collapses onto the argmax. At T → ∞ it approaches uniform. This is exactly the "temperature" you tune when generating text from an LLM.
Produces a valid probability distribution, which makes it the natural output for any classification task and the natural way to convert logits into a categorical for sampling.
Differentiable everywhere, with a clean derivative that combines elegantly with cross-entropy loss (the two together produce simple gradients: softmax(z) − y).
Temperature parameter gives smooth control over output sharpness, useful for distillation, exploration, and sampling diversity.
Computing all ezᵢ values can be numerically unstable — large logits overflow. The "subtract max" trick is mandatory in any real implementation.
Cost grows linearly with vocabulary size, which becomes a bottleneck for LLMs with vocabularies of 32k-256k tokens. Various approximations (sampled softmax, adaptive softmax) exist.
Produces dense outputs — every class gets nonzero probability, which can be a problem for hard-decision tasks.
These aren't activation functions in the strict sense — they're entire layer architectures. A Gated Linear Unit splits a linear projection into two halves, applies an activation to one half, and element-wise multiplies them together. The "gate" decides how much of the other half to let through. SwiGLU (Swish-GLU) and GeGLU (GELU-GLU) are the most successful variants and now dominate state-of-the-art LLM feedforward layers.
Consistently improves quality on language modeling benchmarks compared to standard FFN layers using ReLU or GELU. The gating mechanism is more expressive.
The element-wise multiplication is essentially free; the main cost is the extra weight matrix, which can be offset by narrowing the hidden dimension.
Has become the de facto choice in open-source large language models since 2023.
Three weight matrices instead of two complicates the implementation — naive code uses 50% more parameters before compensating.
Slightly higher memory pressure during training due to the extra activations needed for the gate.
The theoretical justification for why GLU variants work better remains thin — empirical preference, not derivation.
| Name | Formula | Range | Typical use |
|---|---|---|---|
| Sigmoid | 1 / (1 + e−x) | (0, 1) | Binary output, gating, attention masks |
| Tanh | (ex−e−x)/(ex+e−x) | (−1, 1) | RNN/LSTM hidden states |
| ReLU | max(0, x) | [0, ∞) | CNN/MLP hidden layers; default in vision |
| Leaky ReLU | x if x > 0 else 0.01x | (−∞, ∞) | GAN discriminators; cases where dying ReLU is a risk |
| PReLU | x if x > 0 else αx (learned) | (−∞, ∞) | Image classification with learnable slope per channel |
| ELU | x if x > 0 else α(ex−1) | (−α, ∞) | Deeper networks; smoother gradient landscape |
| SELU | λ · ELU(x), specific α, λ | (−λα, ∞) | Self-normalizing networks without BN |
| GELU | x · Φ(x) | ≈ (−0.17, ∞) | BERT, GPT-2/3, ViT transformer FFN |
| Swish / SiLU | x · σ(x) | ≈ (−0.28, ∞) | EfficientNet, MobileNetV3; inside SwiGLU |
| Mish | x · tanh(softplus(x)) | ≈ (−0.31, ∞) | YOLOv4, vision SOTA |
| Softplus | ln(1 + ex) | (0, ∞) | Smooth ReLU; positivity constraints (e.g. predicted variance) |
| Softmax | ezᵢ / Σⱼ ezⱼ | (0, 1) summing to 1 | Classification output, attention weights, action policies |
| SwiGLU | (xW) ⊙ Swish(xV) | unbounded | LLaMA, PaLM, Mistral, Gemma FFN |
| GeGLU | (xW) ⊙ GELU(xV) | unbounded | T5 v1.1, some modern LLMs |
The default loss for regression. MSE penalizes the squared distance between prediction and target, then averages across the dataset. The squaring has two effects: it makes the loss always positive, and it punishes large errors disproportionately — an error of 4 contributes 16 to the loss, while an error of 1 contributes 1.
Smooth and differentiable everywhere, with a clean closed-form solution for linear models (OLS). Gradient is linear in the residual, making optimization well-behaved.
Statistically corresponds to maximum likelihood estimation under Gaussian noise — the right choice when residuals are approximately normal.
Extremely sensitive to outliers. A single mislabeled data point with a huge residual dominates the gradient and can derail training.
Loss units are the square of the target's units, making the value harder to interpret. RMSE (square root of MSE) is often reported instead.
MAE is the natural alternative to MSE. Instead of squaring residuals, it takes their absolute value. This makes every error contribute linearly to the loss — an error of 4 contributes 4, not 16. The practical consequence is robustness: a single outlier carries the same weight as any other point, instead of dominating the training signal.
Robust to outliers — single extreme errors don't dominate the gradient.
Loss value is in the same units as the target, making it directly interpretable ("on average we're off by 12.3 dollars").
Not differentiable at zero (the absolute value has a kink). Subgradient methods work but optimization can be slightly less smooth than with MSE.
Constant gradient magnitude means the loss doesn't shrink the gradient as predictions improve — convergence to high precision is slower than MSE near the optimum.
Huber is the compromise. It's quadratic for small errors (like MSE) and linear for large errors (like MAE). A threshold parameter δ controls where the transition happens. The result is a loss that's smooth at zero, has bounded gradient magnitude for outliers, and behaves like MSE near the optimum — getting the best of both worlds.
Combines MSE's smoothness near zero with MAE's outlier robustness. Smooth at every point including zero (unlike MAE).
The transition parameter δ gives explicit control over how aggressively to treat outliers as outliers.
Requires choosing δ — too small and it behaves like MAE everywhere; too large and it's just MSE.
Slightly more expensive than either MSE or MAE alone, and the piecewise definition complicates analytical derivations.
The losses above all predict the mean (MSE) or the median (MAE). Quantile loss generalizes this — train with quantile τ and the network learns to predict the τ-th quantile of the conditional distribution. Predict τ = 0.5 and you recover MAE (median regression); predict τ = 0.9 and you get the upper 90th percentile. This is essential for probabilistic forecasting.
Models distributions, not just point estimates. Train multiple heads at different quantiles (0.1, 0.5, 0.9) to get a complete prediction interval.
Asymmetric penalty allows tuning for asymmetric costs — e.g., under-stocking a warehouse may be worse than over-stocking, motivating τ > 0.5.
Not differentiable at zero. Different quantile heads can produce inconsistent predictions (e.g., the 0.9 quantile prediction below the 0.5 quantile), requiring monotone constraints.
Less intuitive than MSE/MAE and harder to communicate to non-specialists.
The loss function that powers logistic regression and every binary neural classifier. BCE measures the negative log-likelihood of the true label under the predicted probability. If you're confident and right, the loss is near zero. If you're confident and wrong, the loss explodes — that's the asymmetry that makes it so effective.
Numerical stability matters. Computing log of a sigmoid output directly can underflow when the sigmoid is near 0 or 1. The standard implementation, BCE-with-logits, takes raw logits and combines sigmoid+log into one numerically safe operation using the log-sum-exp trick.
Maximum likelihood estimator for the Bernoulli distribution — statistically principled. Unbounded penalty for confident-wrong predictions provides strong learning signal exactly when it's needed.
Combines cleanly with sigmoid: the gradient through the combined sigmoid-BCE is simply ŷ − y. No vanishing gradient when paired this way.
Sensitive to class imbalance. A "predict no" model on 99% negative data has very low BCE despite being useless. Pair with class weights or focal loss.
Numerical issues near probabilities of 0 and 1 require careful implementation (use the with-logits variant).
The multi-class generalization of BCE. The model produces a softmax distribution over K classes; cross-entropy measures the negative log-probability assigned to the true class. Only the probability assigned to the correct class matters — the rest of the distribution doesn't affect the loss directly (though it affects it indirectly through the softmax normalization).
The natural maximum likelihood loss for categorical distributions. Pairs perfectly with softmax — the gradient is the difference between predicted and target distributions.
Works for any number of classes, from binary up to 50k+ vocabularies in language models.
Treats all wrong classes identically. Predicting "cat" instead of "dog" gets the same penalty as predicting "cat" instead of "airplane" — even though one error is much more reasonable than the other.
Encourages overconfidence. The loss only goes to zero as the true class probability goes to 1, which can produce miscalibrated models. Label smoothing addresses this.
Sensitive to mislabeled data. A wrong label produces an unbounded gradient pulling the model away from the truth.
Focal loss was introduced for dense object detection, where the foreground/background imbalance is extreme — roughly 1000:1. Standard cross-entropy was failing because the gradient signal was dominated by the vast number of easy background examples. Focal loss multiplies cross-entropy by a factor (1 − pₜ)γ that down-weights easy examples (where the model is already correct) and up-weights hard examples (where the model is wrong or unsure).
Dramatically improves training on severely imbalanced data without needing to subsample or oversample. The gradient flow naturally focuses on hard examples.
The γ parameter gives explicit control over how aggressively to focus, and reduces to standard CE when γ = 0 for backwards compatibility.
Extra hyperparameter γ to tune (typically 1-5, with 2 being a common default). Adds slight compute overhead per training step.
Can over-focus on hard examples that are actually mislabeled, increasing the influence of label noise.
The loss that defines support vector machines. Hinge loss treats classification as a margin-maximization problem rather than a probability problem. As long as the model's score for the true class beats every other class's score by at least 1, the loss is zero. If a wrong class comes within 1 of the right class, the loss starts to grow linearly.
Margin-based — once the model has classified an example correctly with enough confidence, that example stops contributing to the loss. Focuses learning on the hard, near-boundary cases.
Theoretical guarantees about generalization (statistical learning theory was built around SVMs).
Not differentiable at the kink. Doesn't produce probability outputs — calibration requires post-hoc methods like Platt scaling.
Largely superseded by cross-entropy for deep learning. Cross-entropy continues to provide gradient signal even for confident-correct predictions, which empirically converges better with neural networks.
Cross-entropy measures the loss between a prediction and a hard label. KL divergence generalizes this to measure the "distance" between two full probability distributions. It's not symmetric — KL(P‖Q) is not the same as KL(Q‖P) — but it's zero exactly when the two distributions match, and grows when they differ.
The natural measure of distribution difference. Has a clean information-theoretic interpretation (extra bits needed to encode samples from p using a code optimized for q).
Foundational for many modern training paradigms — VAEs, distillation, RLHF — all build on KL.
Asymmetric. KL(p‖q) penalizes q being small where p is large, but not the reverse. "Reverse KL" KL(q‖p) is sometimes used for different behavior — choosing one over the other has real consequences for what the model learns.
Undefined when q has zero probability where p has positive probability. Numerical care required.
Label smoothing isn't a new loss function — it's a modification of the targets used inside cross-entropy. Instead of using one-hot labels (1 for the true class, 0 for all others), label smoothing uses 1 − ε for the true class and ε / (K − 1) for each other class. The model is no longer asked to be 100% confident in any prediction.
Improves model calibration — predicted probabilities better match true frequencies. Useful when downstream decisions depend on confidence, not just argmax.
Acts as a regularizer, often improving validation accuracy. Standard in transformer training since the original "Attention Is All You Need" paper.
Bounds the maximum achievable probability — the model can never output 1.0 for a class. Slightly harms top-1 accuracy in rare cases.
Less effective when combined with knowledge distillation, which already provides soft targets.
The classic embedding-learning loss. A triplet consists of an anchor, a positive example (similar to the anchor) and a negative (dissimilar). The loss pushes the anchor-positive distance below the anchor-negative distance by at least a margin α. After training, similar items cluster together in embedding space and dissimilar ones are far apart.
Produces general-purpose embeddings — once trained, the same network can score similarity for any pair, including for classes never seen during training (zero-shot recognition).
The margin parameter gives intuitive control over how strongly to separate classes.
Triplet selection is critical and difficult. Easy triplets (where the model already satisfies the margin) contribute zero gradient. Hard negative mining — finding triplets that violate the margin — is a research problem in itself.
Quadratic or cubic explosion in the number of possible triplets. Has been largely superseded by batch-level losses like InfoNCE.
The predecessor to triplet loss. Operates on pairs labeled as "similar" or "dissimilar". For similar pairs, minimize distance directly. For dissimilar pairs, push apart up to a margin — beyond that, contribute no gradient.
Conceptually simple — just pairs. Easier to construct training batches than triplets, especially in semi-supervised settings.
Doesn't enforce a margin between classes — only an absolute distance threshold for dissimilar pairs. Triplet loss provides better gradients for cluster separation.
Largely superseded by triplet and InfoNCE losses.
InfoNCE (Noise Contrastive Estimation) is what made contrastive learning explode in 2020. Instead of working with single triplets, it treats one batch as a giant classification problem: given an anchor, identify the correct positive among N − 1 negatives. The "loss" is just multi-class cross-entropy where the classes are the items in the batch and the logits are similarities.
Batch-level: every other example in the batch automatically becomes a negative, eliminating triplet-selection problems. Larger batches → harder negatives → better learning.
Works completely self-supervised — augmentations of the same image are "positives", everything else is negative. SimCLR showed this rivals supervised pretraining.
Foundation of modern multimodal models (CLIP) and text embedding models (E5, BGE, OpenAI embeddings).
Requires very large batch sizes (typically 1024-32768) for effective negative sampling. Memory-hungry; needs gradient accumulation tricks or special infrastructure.
Sensitive to temperature τ. Sensitive to negative composition — false negatives (semantically similar items treated as dissimilar) hurt learning.
The objective that powers every autoregressive language model. At each position in a sequence, the model predicts a probability distribution over the vocabulary for what comes next. The loss is cross-entropy between that distribution and the actual next token. Average across the entire sequence, average across the entire dataset, and you have the pretraining loss for GPT-2, GPT-3, GPT-4, LLaMA, Mistral, and every other autoregressive LLM.
This is just categorical cross-entropy applied auto-regressively. What makes it powerful is the scale — train on trillions of tokens with this single objective, and the model learns grammar, facts, reasoning patterns, code, and arguably much more, all from predicting the next token.
Self-supervised — no labels required. Any text becomes training data, which is why LLMs can scale to trillions of tokens of web text.
Single simple objective produces extraordinarily general capabilities. The bitter lesson — that scale plus a simple objective beats sophisticated handcrafted methods — was first proven here.
Maximum likelihood estimation, statistically principled.
Teacher forcing during training — at every position the model conditions on the true previous tokens, not its own predictions. This creates an "exposure bias" between training and inference where errors compound.
Loss treats all tokens equally, including filler tokens and stop words that carry little information. Better importance weighting is an open research area.
Doesn't directly optimize what we care about (helpful, harmless responses) — only the likelihood of the training distribution.
BERT took a different approach. Instead of predicting the next token, it randomly masks 15% of tokens and asks the model to recover them, conditioning on the rest of the sequence — both left and right context. This is what makes BERT bidirectional, and what makes it good at understanding tasks rather than generation tasks.
The 15% masking ratio was chosen empirically. Of those 15%, the original paper replaces 80% with [MASK], 10% with a random token, and leaves 10% unchanged — a hack to reduce train-test mismatch since [MASK] never appears at inference.
Bidirectional context — every position sees both left and right context, unlike autoregressive models. Better for understanding tasks (classification, NER, QA).
Self-supervised like autoregressive pretraining; any text is training data.
Only 15% of positions produce loss signal per batch — sample efficiency is roughly 6× worse than autoregressive training. To match GPT-style total signal, MLM needs 6× more compute.
Can't directly generate text without specialized decoding. Largely displaced by decoder-only autoregressive models for almost all tasks since 2022.
Perplexity isn't a training loss — it's the most common evaluation metric for language models, and it's just the exponential of the cross-entropy loss. If cross-entropy is "average bits per token of surprise" (in natural log units), perplexity is "effective vocabulary size the model is choosing among at each step." Lower is better.
Bounded interpretation — "effective vocabulary size" gives an intuitive way to compare models. A drop from PPL 20 to PPL 10 means the model is twice as effective at narrowing down each next token.
Standardized across LM literature, making cross-paper comparison straightforward.
Doesn't measure what users care about. Lower perplexity correlates with helpfulness, factuality, and reasoning, but the correlation is loose at the top end. Two models with identical perplexity can have very different chat quality.
Sensitive to tokenization. Two models with different tokenizers can't be compared directly via perplexity.
Distillation transfers knowledge from a large "teacher" model to a smaller "student" model. The student learns to match the teacher's softmax distribution, not just the hard target labels. The teacher's "dark knowledge" — the relative probabilities it assigns to non-target classes — turns out to be more informative than the labels alone.
Produces smaller, faster models with much of the teacher's performance. DistilBERT keeps 95% of BERT's GLUE score at 40% the size.
Soft targets carry more information than hard labels — the teacher's probability distribution acts as a richer supervision signal.
Requires a strong teacher already trained. Two hyperparameters (α, T) to tune. The student inherits the teacher's biases and errors.
Distillation gains diminish as the student approaches the teacher's capacity — you can't make an arbitrary student match an arbitrary teacher.
After pretraining produces a base language model that knows how to predict text, supervised fine-tuning teaches it to follow instructions. The data is curated: humans write (or curate) example prompts and ideal responses. The loss is the same next-token cross-entropy as pretraining, but applied only to the response tokens, not the prompt.
The simplest and most effective way to turn a base model into an instruction-follower. A few thousand high-quality examples can dramatically change model behavior.
Same loss as pretraining — no new training infrastructure needed.
Only as good as the demonstrations. The model learns to mimic the surface style of the responses, not the underlying preferences that produced them.
Doesn't capture preferences over multiple possible responses — humans rarely write the single "best" answer, just a reasonable one.
To go beyond SFT, modern alignment uses human preferences. Annotators are shown two model responses to the same prompt and asked which is better. The reward model is trained to assign higher scalar scores to preferred responses than to dispreferred ones. The loss is derived from the Bradley-Terry preference model — under which the probability that response y₁ is preferred over y₂ is the sigmoid of their reward difference.
Captures relative preferences without requiring annotators to produce absolute scores — much easier to collect reliably.
Once trained, the reward model becomes a stand-in for human judgment at scale, enabling RL fine-tuning on millions of generated responses.
Reward hacking — the policy can find ways to score high reward that don't actually align with what humans want. The reward model is an imperfect proxy.
Annotator disagreement is high, and the Bradley-Terry assumption (transitive, scalar utility) often doesn't hold.
With a reward model in hand, the next step is to fine-tune the language model to produce responses that score high under the reward model — without drifting too far from the original SFT model and losing fluency. PPO is the policy gradient algorithm that achieves this. The objective combines the expected reward with a KL penalty against a frozen reference (the SFT model) to prevent reward hacking.
The clipping is what gives PPO its name: by clipping the policy ratio to [1 − ε, 1 + ε], it prevents catastrophic policy updates that would take the model far from where it can still produce coherent text. The KL term on top of that ensures the policy stays in a neighborhood of the SFT model.
The proven workhorse of LLM alignment from 2022-2023. Empirically effective at improving helpfulness and harmlessness above what SFT achieves alone.
The KL penalty against the reference policy prevents the model from degenerating into reward-hacking gibberish.
Complex pipeline: requires a trained reward model, a frozen reference policy, an actor model, and often a value/critic model — four model instances at once. Memory and compute hungry.
Many hyperparameters to tune (KL coefficient, clip range, learning rate, number of PPO epochs). Unstable training, sensitive to reward model quality.
Largely being replaced by DPO and related methods that achieve similar results without the RL machinery.
The breakthrough paper of 2023. DPO showed that PPO's three-stage process (train reward model → run PPO with KL penalty against reference) is mathematically equivalent to a single supervised loss directly on preference pairs. The reward model is implicit, encoded in the policy itself. No reward training, no rollouts, no PPO — just a single-stage loss almost as simple as SFT.
The intuition: every preference pair (chosen, rejected) provides a signal. The loss increases the relative likelihood the policy assigns to chosen responses and decreases it for rejected ones — but always weighted against what the reference model thought, so the policy can't drift too far.
Dramatically simpler than PPO — one model to train, no rollouts, no reward model. Approximately matches PPO's quality while being orders of magnitude cheaper.
Stable training without RL's sensitivity to hyperparameters. Standard cross-entropy-flavored loss that any researcher can debug.
Has rapidly become the default preference-tuning method in open-source.
The reference model is fixed during training, which can over-constrain the policy. Several extensions (DPOP, KTO) address this.
Can produce models that are overly confident on training preferences but degrade on out-of-distribution prompts. The implicit reward is not separately validated.
The β hyperparameter is critical and dataset-dependent.
DPO inspired an entire family of follow-ups, each addressing some weakness of the original.
DPO requires preference pairs (a chosen and rejected response for the same prompt). KTO works with single labeled examples — each response is just marked "good" or "bad", no pair required. It uses prospect theory (Kahneman-Tversky) to model human utility asymmetrically: losses hurt more than equivalent gains. This is more data-efficient when you have abundant binary feedback rather than pairwise preferences.
ORPO combines SFT and preference optimization into a single stage. The loss is SFT plus a small odds-ratio penalty that pushes the model away from dispreferred responses. By doing both at once, ORPO skips the SFT-then-DPO two-stage pipeline entirely.
DPO's loss has been shown to overfit on preference pairs — it can push toward extreme probability assignments. IPO replaces the log-sigmoid with a squared difference, producing a "smoother" loss that's more resistant to overfitting on noisy preference data.
The family offers a spectrum of trade-offs: KTO for binary feedback, ORPO for unified training, IPO for noisy preferences. All inherit DPO's simplicity over PPO.
Easier to ablate and compare than RL methods, accelerating empirical progress in alignment.
No single winner has emerged. Empirical performance varies by dataset, base model, and evaluation method.
All inherit DPO's fundamental limitation: a frozen reference policy. Long training runs can hit a wall where the reference is too far from where the policy needs to go.
When the target is a single 0/1 label — spam or not spam, fraud or not, click or no click, malignant or benign — the standard recipe is one output logit, a sigmoid activation, and binary cross-entropy loss. The model produces a real number; sigmoid squashes it into (0, 1); BCE penalizes the gap between that probability and the true label.
Class imbalance: pass pos_weight to upweight rare positives. For severe imbalance (> 100:1), switch to focal loss instead.
Calibration: after training, fit a temperature scalar on a validation set to correct over- or under-confidence.
Don't use softmax over 2 classes. It doubles the output parameters for no benefit and produces identical predictions.
Don't apply sigmoid before BCE manually — use the fused _with_logits version. Manual sigmoid + log(BCE) underflows near 0 and 1.
When each example has exactly one label out of K mutually exclusive options — digit recognition out of 10, ImageNet out of 1,000, sentiment out of 5 — the recipe is K output logits, a softmax activation, and categorical cross-entropy. The softmax forces the K probabilities to sum to 1, so the model is implicitly told "exactly one of these is true."
Label smoothing (ε = 0.1 is typical) improves calibration and slightly hurts top-1 accuracy. Standard in transformer training.
Mixup / CutMix blend pairs of inputs and their targets; soft targets become natural and the model learns smoother decision boundaries.
Distillation replaces hard labels with a teacher model's soft probability distribution — see §29.
Don't use this when labels aren't exclusive. If a single example can have multiple labels (image with both "cat" and "dog"), softmax forces them to compete. You want multi-label (§37) instead.
Don't apply softmax before cross_entropy manually. Use raw logits — PyTorch's cross_entropy applies log-softmax internally with numerical safety.
When each example can carry any subset of K labels — a photo simultaneously tagged "beach", "sunset", and "palm tree"; a paper that is both "ML" and "optimization"; a movie spanning multiple Netflix genres — the recipe is K independent logits, a sigmoid per class, and binary cross-entropy summed across classes. The crucial difference from multi-class is that labels do not compete: the model can be fully confident in three labels at once without that confidence having to come from somewhere.
Asymmetric loss (ASL): different exponents for positive and negative class terms — handles severe per-label imbalance better than vanilla BCE. Used in ImageNet-21k multi-label setup.
Focal loss per class: apply the focal modulation (§19) to each sigmoid output independently.
Hierarchical labels: if the K labels form a tree (genre → sub-genre), use hierarchical softmax instead of flat BCE.
The #1 multi-label mistake: using softmax + categorical CE. It silently makes labels compete, hurts performance noticeably, and the bug is invisible until you look at the math.
Per-class thresholding: using 0.5 for every class is rarely optimal. Tune one threshold per class on a validation set, especially under imbalance.
Metrics: accuracy is meaningless here. Use macro/micro F1, mean average precision, or per-class AUC.
When the target is a continuous real number — house price, temperature, age estimation, predicted return — the recipe is one scalar output (or D scalars for multi-target regression), no activation at all on the output (just a linear layer), and one of the regression losses from §13–§16. The choice between MSE, MAE, and Huber comes down to how outlier-prone the target distribution is.
Bounded target ∈ [0, 1]: apply sigmoid to the output and use BCE — gives calibrated probabilistic regression for free. Works as well as linear + MSE and often better.
Strictly positive target: predict log(target) with linear output and MSE, then exponentiate at inference. Or use softplus output activation.
Heteroscedastic noise: predict (μ, log σ) pair, optimize the Gaussian negative log-likelihood. The model learns to be uncertain where the data is noisy.
Quantile regression: train K parallel heads at different quantiles to get full prediction intervals — see §16.
Not normalizing targets. A model trained on prices in dollars vs. thousands-of-dollars produces wildly different gradient scales for the same data. Always standardize the target.
Using MSE with heavy-tailed targets. A few extreme outliers can dominate training; switch to MAE or Huber, or apply a log/Box-Cox transform first.
Forgetting that R² can go negative. On held-out data, a model worse than predicting the training mean has R² < 0. Track MAE/RMSE as well for an unambiguous error measure.
The full recipe at a glance, for the four tasks that cover the vast majority of supervised deep learning:
| Task | Target shape | Output activation | Loss function |
|---|---|---|---|
| Binary classification | scalar in {0, 1} | σ(z) — sigmoid | BCE = −[y log ŷ + (1−y) log(1−ŷ)] |
| Multi-class (one label) | int in {0, ..., K−1} or one-hot |
softmax(z) ∈ ΔK−1 | Categorical CE = −log ŷtrue |
| Multi-label | multi-hot vector in {0, 1}K |
σ(zk) per class (independent, NOT softmax) |
Σk BCE(ŷk, yk) |
| Regression | real number(s) in ℝ (or constrained range) |
none (linear) σ if bounded [0,1] softplus if > 0 |
MSE / MAE / Huber / Quantile |
| Imbalanced classification | as above, but rare class | σ or softmax as appropriate | Focal loss, or weighted BCE/CE |
| Sequence (next token) | int sequence over V tokens | softmax over V | Σt categorical CE, masked to response |
| Embedding learning | pairs / triplets / batches | L2-normalize embedding | Triplet, contrastive, or InfoNCE |
One last principle. The output activation and the loss are a matched pair — they're designed to combine cleanly in the gradient. Sigmoid + BCE gives a gradient of ŷ − y. Softmax + categorical CE gives a gradient of softmax(z) − onehot(y). Mixing components from different recipes (e.g., softmax + MSE, or sigmoid + categorical CE) almost always either silently slows training or fundamentally encodes the wrong objective. Pick the recipe that matches your task, then resist the urge to deviate.
| Name | Formula | Use case |
|---|---|---|
| MSE / L2 | (1/n) Σ(y − ŷ)² | Default regression; Gaussian noise assumption |
| MAE / L1 | (1/n) Σ|y − ŷ| | Robust regression; outlier-prone targets |
| Huber | quadratic for |r|≤δ, linear else | Bounding-box regression; RL value targets |
| Quantile | max(τr, (τ−1)r) | Prediction intervals; probabilistic forecasts |
| Binary CE | −[y log ŷ + (1−y) log(1−ŷ)] | Binary classification; multi-label |
| Categorical CE | −log ŷtrue | Multi-class classification; default deep classification |
| Focal | −(1−pₜ)γ log pₜ | Severely imbalanced classification; dense detection |
| Hinge | max(0, 1 − y·ŷ) | SVMs; margin-based classification |
| KL divergence | Σ p log(p/q) | VAE; distillation; RL policy regularization |
| Label smoothing | CE with target = (1−ε) onehot + ε/K | Image classification; transformer training |
| Triplet | max(0, d(a,p) − d(a,n) + α) | Face recognition; embedding learning |
| Contrastive (pair) | y·d² + (1−y)·max(0, m−d)² | Siamese networks; signature verification |
| InfoNCE | −log(exp(sim+/τ) / Σexp(sim/τ)) | CLIP; SimCLR; sentence embeddings; modern retrieval |
| Next-token CE | −Σ log p(xt|x<t) | LLM pretraining; autoregressive generation |
| Masked LM | −Σmasked log p(xi|context) | BERT-family encoder pretraining |
| Distillation | α·CE + (1−α)·T²·KL | DistilBERT; smaller LLM variants; mobile models |
| SFT | next-token CE on response tokens only | Instruction tuning; first stage of alignment |
| Reward model (BT) | −log σ(r(yw) − r(yl)) | Training reward models for RLHF |
| PPO | clipped policy gradient + KL penalty | RLHF alignment (ChatGPT, LLaMA-Chat era) |
| DPO | −log σ(β·log π/πref ratio difference) | Modern alignment; preference tuning without RL |
| KTO | prospect-theoretic single-label preference | Binary good/bad feedback; unbalanced label data |
| ORPO | SFT + odds-ratio penalty | Single-stage instruction + preference tuning |
| IPO | squared margin (smoother DPO) | Noisy preference data; long training runs |
A final structural observation. The history of losses is a history of asking better questions. MSE asks "how far off?" Cross-entropy asks "how confident, in which direction?" Triplet asks "is this closer than that?" InfoNCE asks "which one among N is the match?" DPO asks "which response do you prefer?" Each shift unlocked new capabilities — robust regression, calibrated classification, learned embeddings, multimodal alignment, instruction following. Pick the loss that asks the question you actually want answered.