Activation & loss functions,
visualized.

A working tour of the two function families that drive deep learning — from sigmoid to SwiGLU, from MSE to DPO. What each one is, why it exists, what it's used for, and where it breaks down.

30+ functions, 7 interactive demos Approx. 40 min read

Contents

Part I · Activation functions

Why activations exist
Tanh
ReLU
Leaky ReLU, PReLU
ELU, SELU
GELU
Swish / SiLU
Mish
GLU, SwiGLU, GeGLU
Activation reference

Part II · Loss functions

Regression

MSE / L2
MAE / L1
Huber
Quantile loss

Classification

Binary cross-entropy
Categorical cross-entropy
Focal loss
Hinge loss
KL divergence
Label smoothing

Embedding & contrastive

Triplet loss
Contrastive loss
InfoNCE (CLIP, SimCLR)

Sequence & LLM

Next-token cross-entropy
Masked LM loss (BERT)
Perplexity
Knowledge distillation
SFT — supervised fine-tuning
Reward model loss (Bradley-Terry)
PPO (RLHF)
DPO — direct preference optimization
KTO, ORPO, IPO

Loss reference

See also

For the classification study path — sigmoid, softmax, cross-entropy loss, and confusion-matrix metrics taught together — see Classification & Loss Functions.

Part I

Activation functions

The element-wise nonlinearities applied after every linear layer. Without them, a neural network of any depth would be equivalent to a single linear layer.

§ 01Why activations exist

A linear layer computes y = Wx + b. Stack two of them: y = W₂(W₁x + b₁) + b₂ = (W₂W₁)x + (W₂b₁ + b₂). Two linear layers collapse into one. Three collapse into one. A thousand collapse into one. No matter how deep you make a purely linear network, it can only represent linear functions.

The job of an activation function is to break linearity. A single element-wise nonlinearity inserted between layers is enough to make a network a universal approximator — capable, in principle, of representing any continuous function given enough width.

That's the conceptual role. The practical role is harder: an activation must be cheap to compute, easy to differentiate, and produce well-behaved gradients during backpropagation. The history of deep learning is largely the history of finding activations that satisfy all three. Sigmoid saturates and kills gradients. ReLU is fast but kills neurons. GELU and Swish are smooth, gradient-friendly, and now dominate large models.

The demo below plots every common activation on the same axes. Pick a function from the dropdown to see its shape, its derivative, and its key properties at a glance.

function show derivative overlay all

activation f(x) derivative f'(x) y = 0

Range

[0, ∞)

unbounded above

Smooth?

kink at x = 0

Zero-centered?

outputs always ≥ 0

Cost

Cheap

single max op

§ 03Tanh — hyperbolic tangent

Tanh is sigmoid's zero-centered cousin. It maps real numbers to (−1, 1) instead of (0, 1), preserving sign information and centering outputs on zero. For decades this was the default activation in hidden layers, because the centered output produces more balanced gradients and faster convergence than sigmoid.

tanh(x) = (e^x − e^−x) / (e^x + e^−x) · tanh'(x) = 1 − tanh²(x) Output range (−1, 1). Derivative peaks at 1 when x = 0 — four times sigmoid's peak.

Used in RNN hidden states LSTM cell outputs Old CNN classifiers VAE encoder outputs (sometimes)

Example use

An LSTM cell uses tanh twice — once to squash the candidate cell state to (−1, 1), and again to squash the output gate. The bounded range prevents the cell state from exploding.

Merits

Zero-centered output dramatically helps gradient flow compared to sigmoid. The peak derivative is 1.0, four times sigmoid's peak.

Still bounded, which keeps activations stable and is critical for recurrent networks where activations are reused across many time steps.

Demerits

Still saturates in both tails — large positive or negative inputs produce near-zero gradients. Vanishing gradient problem only slightly mitigated, not solved.

Two exponentials per evaluation. Replaced almost everywhere by ReLU and friends in feedforward layers.

§ 04ReLU — rectified linear unit

The activation that made deep learning practical. ReLU is shockingly simple — output the input if positive, zero otherwise — but this simplicity changed everything. For the positive half, the derivative is exactly 1, so gradients flow through deep stacks without attenuation. Training networks with 50, 100, 500 layers became feasible.

ReLU(x) = max(0, x) · ReLU'(x) = 1 if x > 0 else 0 Output range [0, ∞). Constant gradient of 1 for positive inputs solves vanishing gradients in feedforward nets.

Used in AlexNet, VGG, ResNet hidden layers Most CNNs Most MLPs Transformer FFN (original)

Example use

In ResNet-50, every convolutional layer is followed by batch normalization and ReLU. A typical activation map after a conv layer has roughly half its values clipped to zero — this sparsity is part of why ReLU networks work well.

Merits

Trivially cheap to compute — a single comparison. Drives most of the speed gains in modern deep learning.

Non-saturating for positive inputs. Gradient of exactly 1 means deep networks train without vanishing gradients in the forward direction.

Induces sparse activations — roughly half of all neurons output zero for any given input, providing a form of natural regularization and computational efficiency.

Demerits

Dying ReLU problem. A neuron whose pre-activation becomes very negative gets stuck at zero output and zero gradient — permanently dead. Large learning rates can kill 40%+ of neurons.

Not differentiable at x = 0 (subgradient used in practice). Outputs are unbounded above, which can lead to exploding activations without normalization.

Not zero-centered.

§ 05Leaky ReLU and PReLU

Leaky ReLU was the first attempt to address dying neurons. Instead of zeroing negative inputs entirely, it lets a small fraction through — typically 0.01. The negative side still has a gradient, so dead neurons can come back to life. PReLU (Parametric ReLU) takes this one step further by making the slope a learnable parameter, optimized jointly with the network weights.

LeakyReLU(x) = x if x > 0 else αx · α typically 0.01 PReLU: α is a learned per-channel parameter rather than fixed.

Used in GANs (especially the discriminator) Image generation networks PReLU famously in ResNet-200 variants

Example use

In a GAN discriminator, dying ReLU is catastrophic — a dead discriminator can no longer distinguish real from fake images. Leaky ReLU with α = 0.2 is the standard choice. The gradient through negative inputs keeps the discriminator alive throughout training.

Merits

Eliminates the dying neuron problem at essentially zero additional compute. The small slope on the negative side keeps gradients flowing.

PReLU lets the network learn the optimal slope per channel, sometimes giving small but consistent accuracy improvements.

Demerits

The leak coefficient α is arbitrary — 0.01 is conventional but not principled. PReLU partially fixes this but adds learnable parameters and complicates regularization.

Empirical improvements over ReLU are inconsistent. Many practitioners find no meaningful gain on standard image classification benchmarks.

§ 06ELU and SELU

ELU (Exponential Linear Unit) uses an exponential curve on the negative side, smoothly approaching −α as x → −∞. This is bounded — unlike Leaky ReLU's unbounded negative side — and smooth at zero. SELU (Scaled ELU) carefully chooses α and a scaling factor λ such that activations through deep networks self-normalize to zero mean and unit variance, removing the need for batch normalization.

ELU(x) = x if x > 0 else α(e^x − 1)
SELU(x) = λ · ELU(x) with α ≈ 1.6733, λ ≈ 1.0507 SELU's constants are not arbitrary — they're derived from a fixed-point analysis of mean and variance through deep nets.

Used in Self-Normalizing Networks (SNNs) MLPs without BatchNorm Sparse data scenarios

Merits

ELU's negative saturation makes outputs robust to noise. Smooth at zero, so optimization landscapes are slightly nicer than ReLU's.

SELU enables training very deep networks without batch normalization, when paired with proper weight initialization (LeCun normal) and alpha-dropout.

Demerits

Exponential is more expensive than ReLU's comparison. In practice, ELU is rarely chosen over ReLU on speed-sensitive workloads.

SELU's self-normalization only works under very specific conditions — particular initialization, particular dropout variant, no skip connections. Brittle and largely superseded by LayerNorm/BatchNorm in practice.

§ 07GELU — Gaussian Error Linear Unit

GELU is where activation functions enter the era of large language models. Introduced in 2016 and adopted by BERT, GPT-2, GPT-3, and most transformer encoders, it's a smooth nonlinearity that weights inputs by their value under the standard normal CDF. The intuition is that GELU multiplies x by the probability that a Gaussian draw exceeds zero — a probabilistic version of ReLU's hard gate.

GELU(x) = x · Φ(x) ≈ 0.5 · x · (1 + tanh(√(2/π)(x + 0.044715 x³))) Φ is the standard normal CDF. The right form is a fast approximation used in practice.

Used in BERT GPT-2, GPT-3 RoBERTa Vision Transformer (ViT) T5 (in some variants)

Example use

Every transformer block in BERT-base contains a two-layer feedforward network: FFN(x) = GELU(xW₁ + b₁)W₂ + b₂. With hidden size 3072 between the two linear layers, this FFN is where most of BERT's parameters live and where GELU does its work.

Merits

Smooth everywhere (including at x = 0), which optimization theory mildly prefers. The non-monotonic dip near x = −1 lets it represent slightly more complex behaviors than ReLU.

Empirically outperforms ReLU on transformer architectures, especially at scale. Became the default for transformer feedforward layers throughout 2018-2021.

Demerits

Considerably more expensive than ReLU — the exact form needs an erf evaluation, the approximate form needs tanh plus a cubic.

The advantage over ReLU shrinks with very large models and may not justify the compute cost. Newer architectures favor SwiGLU instead.

§ 08Swish (SiLU)

Discovered by Google's neural architecture search in 2017, Swish is even simpler than GELU but behaves similarly. It's just x · σ(x) — the input gated by its own sigmoid. PyTorch calls the same function SiLU (Sigmoid Linear Unit). The two names refer to the same function, with Swish sometimes generalized to include a learnable parameter β: x · σ(βx).

Swish(x) = x · σ(x) = x / (1 + e^−x) Smooth, non-monotonic, unbounded above, bounded below by approximately −0.28.

Used in EfficientNet MobileNetV3 LLaMA family (in SwiGLU) PaLM YOLO variants

Merits

Slightly cheaper than GELU's approximation (just sigmoid times input). Empirically matches or beats GELU on most large-model benchmarks.

The non-monotonic dip allows a tiny amount of negative output for moderately negative inputs, providing more expressive gradients than ReLU.

Demerits

Still more expensive than ReLU. The gains are real but small (often < 1% accuracy) and may not justify the cost in inference-sensitive applications.

Sigmoid evaluation is required at every forward pass, which on mobile hardware can be a meaningful bottleneck.

§ 09Mish

Mish takes the Swish idea further by replacing the sigmoid gate with a softplus-and-tanh composition. The result is a smooth, non-monotonic curve similar to Swish but with slightly different behavior in the negative tail. Mish gained attention through the YOLOv4 object detection paper, where replacing Leaky ReLU with Mish gave consistent improvements.

Mish(x) = x · tanh(softplus(x)) = x · tanh(ln(1 + e^x)) Smooth, non-monotonic, with a slightly deeper negative dip than Swish.

Used in YOLOv4 Some image classification SOTA

Merits

Reported empirical gains over Swish and GELU on computer vision benchmarks, especially in object detection where YOLOv4 became its flagship use.

Demerits

More expensive than both Swish and GELU. The composition of softplus and tanh is computationally heavier without consistent gains elsewhere.

Not widely adopted outside vision. Transformer-based architectures have largely settled on GELU and SwiGLU.

§ 11GLU, SwiGLU, GeGLU — gated linear units

These aren't activation functions in the strict sense — they're entire layer architectures. A Gated Linear Unit splits a linear projection into two halves, applies an activation to one half, and element-wise multiplies them together. The "gate" decides how much of the other half to let through. SwiGLU (Swish-GLU) and GeGLU (GELU-GLU) are the most successful variants and now dominate state-of-the-art LLM feedforward layers.

GLU(x, W, V) = (xW) ⊙ σ(xV)
SwiGLU(x) = (xW) ⊙ Swish(xV) · GeGLU(x) = (xW) ⊙ GELU(xV) ⊙ is element-wise product. Note: 50% more parameters than a vanilla FFN, but typically narrower hidden size compensates.

Used in LLaMA, LLaMA 2, LLaMA 3 PaLM Mistral Gemma Qwen Most modern open-weights LLMs

Example use — LLaMA FFN

Every transformer block in LLaMA uses a SwiGLU feedforward network: FFN(x) = (Swish(xW₁) ⊙ xW₂)W₃. There are three weight matrices instead of two, but the hidden dimension is reduced by 2/3 to keep the parameter count comparable to a GELU FFN. This swap consistently improves perplexity at no inference cost.

Merits

Consistently improves quality on language modeling benchmarks compared to standard FFN layers using ReLU or GELU. The gating mechanism is more expressive.

The element-wise multiplication is essentially free; the main cost is the extra weight matrix, which can be offset by narrowing the hidden dimension.

Has become the de facto choice in open-source large language models since 2023.

Demerits

Three weight matrices instead of two complicates the implementation — naive code uses 50% more parameters before compensating.

Slightly higher memory pressure during training due to the extra activations needed for the gate.

The theoretical justification for why GLU variants work better remains thin — empirical preference, not derivation.

§ 12Activation reference

Name	Formula	Range	Typical use
Sigmoid	1 / (1 + e^−x)	(0, 1)	Binary output, gating, attention masks
Tanh	(e^x−e^−x)/(e^x+e^−x)	(−1, 1)	RNN/LSTM hidden states
ReLU	max(0, x)	[0, ∞)	CNN/MLP hidden layers; default in vision
Leaky ReLU	x if x > 0 else 0.01x	(−∞, ∞)	GAN discriminators; cases where dying ReLU is a risk
PReLU	x if x > 0 else αx (learned)	(−∞, ∞)	Image classification with learnable slope per channel
ELU	x if x > 0 else α(e^x−1)	(−α, ∞)	Deeper networks; smoother gradient landscape
SELU	λ · ELU(x), specific α, λ	(−λα, ∞)	Self-normalizing networks without BN
GELU	x · Φ(x)	≈ (−0.17, ∞)	BERT, GPT-2/3, ViT transformer FFN
Swish / SiLU	x · σ(x)	≈ (−0.28, ∞)	EfficientNet, MobileNetV3; inside SwiGLU
Mish	x · tanh(softplus(x))	≈ (−0.31, ∞)	YOLOv4, vision SOTA
Softplus	ln(1 + e^x)	(0, ∞)	Smooth ReLU; positivity constraints (e.g. predicted variance)
Softmax	e^zᵢ / Σⱼ e^zⱼ	(0, 1) summing to 1	Classification output, attention weights, action policies
SwiGLU	(xW) ⊙ Swish(xV)	unbounded	LLaMA, PaLM, Mistral, Gemma FFN
GeGLU	(xW) ⊙ GELU(xV)	unbounded	T5 v1.1, some modern LLMs

Part II

Loss functions

The scalar objective every neural network minimizes. Different loss functions encode different beliefs about what counts as a good prediction.

Group A

Regression losses

When the target is a continuous number, the question is "how far off."

§ 13Mean squared error (MSE / L2)

The default loss for regression. MSE penalizes the squared distance between prediction and target, then averages across the dataset. The squaring has two effects: it makes the loss always positive, and it punishes large errors disproportionately — an error of 4 contributes 16 to the loss, while an error of 1 contributes 1.

MSE = (1/n) · Σ (yᵢ − ŷᵢ)² Also called L2 loss. Squared form makes gradient proportional to error magnitude.

Used in Linear regression House price prediction Image reconstruction (autoencoders) VAE reconstruction term Q-learning value targets

Example use

A house price model predicts $420k for a house that actually sold for $400k. The squared error is (420 − 400)² = 400. If another house was off by $40k, its squared error is 1600 — four times the gradient signal, pulling training harder toward fixing the bigger miss.

# PyTorch import torch.nn.functional as F loss = F.mse_loss(pred, target) # Equivalent: loss = ((pred - target) ** 2).mean()

Merits

Smooth and differentiable everywhere, with a clean closed-form solution for linear models (OLS). Gradient is linear in the residual, making optimization well-behaved.

Statistically corresponds to maximum likelihood estimation under Gaussian noise — the right choice when residuals are approximately normal.

Demerits

Extremely sensitive to outliers. A single mislabeled data point with a huge residual dominates the gradient and can derail training.

Loss units are the square of the target's units, making the value harder to interpret. RMSE (square root of MSE) is often reported instead.

§ 14Mean absolute error (MAE / L1)

MAE is the natural alternative to MSE. Instead of squaring residuals, it takes their absolute value. This makes every error contribute linearly to the loss — an error of 4 contributes 4, not 16. The practical consequence is robustness: a single outlier carries the same weight as any other point, instead of dominating the training signal.

MAE = (1/n) · Σ |yᵢ − ŷᵢ| Also L1 loss. Gradient has constant magnitude (sign of error), so doesn't grow with error size.

Used in Robust regression Time series with outliers Forecasting (alongside MAPE) Image-to-image with sharp targets

Example use

In image super-resolution, MSE produces blurry outputs because squaring penalizes any deviation harshly, so the network averages possible details to reduce error. MAE produces sharper images because it treats a small error and a moderate one more proportionally — it's less afraid to commit to one specific detail.

Merits

Robust to outliers — single extreme errors don't dominate the gradient.

Loss value is in the same units as the target, making it directly interpretable ("on average we're off by 12.3 dollars").

Demerits

Not differentiable at zero (the absolute value has a kink). Subgradient methods work but optimization can be slightly less smooth than with MSE.

Constant gradient magnitude means the loss doesn't shrink the gradient as predictions improve — convergence to high precision is slower than MSE near the optimum.

§ 15Huber loss (smooth L1)

Huber is the compromise. It's quadratic for small errors (like MSE) and linear for large errors (like MAE). A threshold parameter δ controls where the transition happens. The result is a loss that's smooth at zero, has bounded gradient magnitude for outliers, and behaves like MSE near the optimum — getting the best of both worlds.

L_δ(y, ŷ) = ½(y − ŷ)² if |y − ŷ| ≤ δ
L_δ(y, ŷ) = δ(|y − ŷ| − ½δ) otherwise Smooth L1 in PyTorch is the special case δ = 1.

Used in Faster R-CNN bounding box regression Robust regression DQN reinforcement learning targets Object detection (most modern detectors)

Example use — bounding boxes

Faster R-CNN regresses bounding box coordinates with smooth L1 (Huber with δ = 1). If MSE were used, a single hard example with a wildly wrong predicted box would dominate the gradient. With smooth L1, large errors contribute linearly, keeping training stable on diverse images.

Merits

Combines MSE's smoothness near zero with MAE's outlier robustness. Smooth at every point including zero (unlike MAE).

The transition parameter δ gives explicit control over how aggressively to treat outliers as outliers.

Demerits

Requires choosing δ — too small and it behaves like MAE everywhere; too large and it's just MSE.

Slightly more expensive than either MSE or MAE alone, and the piecewise definition complicates analytical derivations.

§ 16Quantile (pinball) loss

The losses above all predict the mean (MSE) or the median (MAE). Quantile loss generalizes this — train with quantile τ and the network learns to predict the τ-th quantile of the conditional distribution. Predict τ = 0.5 and you recover MAE (median regression); predict τ = 0.9 and you get the upper 90th percentile. This is essential for probabilistic forecasting.

L_τ(y, ŷ) = max(τ(y − ŷ), (τ − 1)(y − ŷ)) Asymmetric: penalizes under-predictions more if τ > 0.5, over-predictions more if τ < 0.5.

Used in Demand forecasting Prediction intervals Risk-aware regression Probabilistic time series (DeepAR, MQ-CNN)

Merits

Models distributions, not just point estimates. Train multiple heads at different quantiles (0.1, 0.5, 0.9) to get a complete prediction interval.

Asymmetric penalty allows tuning for asymmetric costs — e.g., under-stocking a warehouse may be worse than over-stocking, motivating τ > 0.5.

Demerits

Not differentiable at zero. Different quantile heads can produce inconsistent predictions (e.g., the 0.9 quantile prediction below the 0.5 quantile), requiring monotone constraints.

Less intuitive than MSE/MAE and harder to communicate to non-specialists.

Group B

Classification losses

When the target is a discrete label, the question becomes "how confident, in which direction."

§ 17Binary cross-entropy

The loss function that powers logistic regression and every binary neural classifier. BCE measures the negative log-likelihood of the true label under the predicted probability. If you're confident and right, the loss is near zero. If you're confident and wrong, the loss explodes — that's the asymmetry that makes it so effective.

BCE = −[y · log(ŷ) + (1−y) · log(1−ŷ)] ŷ ∈ (0, 1) is the predicted probability. Loss → ∞ if you predict 0 for a true 1.

Numerical stability matters. Computing log of a sigmoid output directly can underflow when the sigmoid is near 0 or 1. The standard implementation, BCE-with-logits, takes raw logits and combines sigmoid+log into one numerically safe operation using the log-sum-exp trick.

Used in Binary classification Logistic regression Multi-label classification (per-class BCE) GAN discriminator training Click-through rate prediction

Example use

A spam classifier outputs a sigmoid probability for each email. For a true spam email (y = 1), if the model predicts ŷ = 0.92, the loss is −log(0.92) ≈ 0.083. If it incorrectly predicts ŷ = 0.05, the loss is −log(0.05) ≈ 3.0. The 35× larger gradient at this point pushes the model to fix this confidently-wrong prediction quickly.

# Use the with_logits version for numerical stability loss = F.binary_cross_entropy_with_logits(logits, target) # Equivalent to but more stable than: loss = F.binary_cross_entropy(torch.sigmoid(logits), target)

Merits

Maximum likelihood estimator for the Bernoulli distribution — statistically principled. Unbounded penalty for confident-wrong predictions provides strong learning signal exactly when it's needed.

Combines cleanly with sigmoid: the gradient through the combined sigmoid-BCE is simply ŷ − y. No vanishing gradient when paired this way.

Demerits

Sensitive to class imbalance. A "predict no" model on 99% negative data has very low BCE despite being useless. Pair with class weights or focal loss.

Numerical issues near probabilities of 0 and 1 require careful implementation (use the with-logits variant).

§ 18Categorical cross-entropy

The multi-class generalization of BCE. The model produces a softmax distribution over K classes; cross-entropy measures the negative log-probability assigned to the true class. Only the probability assigned to the correct class matters — the rest of the distribution doesn't affect the loss directly (though it affects it indirectly through the softmax normalization).

CE = −Σᵢ yᵢ · log(ŷᵢ) = −log(ŷ_true) For one-hot targets, only the true class contributes — but softmax couples all logits together.

Used in ImageNet classification Multi-class classification everywhere Language modeling (next-token prediction) Machine translation Reinforcement learning policies

Example use

An ImageNet classifier sees a photo of a cat. The true label is class 281 out of 1000. The softmax produces ŷ₂₈₁ = 0.7, so the loss is −log(0.7) ≈ 0.36. The gradient updates push the cat-logit up and pull all other logits down — even though only the cat logit appears in the formula explicitly, every logit is updated because of softmax's denominator.

Merits

The natural maximum likelihood loss for categorical distributions. Pairs perfectly with softmax — the gradient is the difference between predicted and target distributions.

Works for any number of classes, from binary up to 50k+ vocabularies in language models.

Demerits

Treats all wrong classes identically. Predicting "cat" instead of "dog" gets the same penalty as predicting "cat" instead of "airplane" — even though one error is much more reasonable than the other.

Encourages overconfidence. The loss only goes to zero as the true class probability goes to 1, which can produce miscalibrated models. Label smoothing addresses this.

Sensitive to mislabeled data. A wrong label produces an unbounded gradient pulling the model away from the truth.

§ 19Focal loss

Focal loss was introduced for dense object detection, where the foreground/background imbalance is extreme — roughly 1000:1. Standard cross-entropy was failing because the gradient signal was dominated by the vast number of easy background examples. Focal loss multiplies cross-entropy by a factor (1 − pₜ)^γ that down-weights easy examples (where the model is already correct) and up-weights hard examples (where the model is wrong or unsure).

FL(pₜ) = −(1 − pₜ)^γ · log(pₜ) pₜ is the predicted probability of the true class. γ ≥ 0 controls focusing strength. γ = 0 recovers cross-entropy.

Used in RetinaNet Dense object detection Severely imbalanced classification Medical imaging (rare disease detection)

Merits

Dramatically improves training on severely imbalanced data without needing to subsample or oversample. The gradient flow naturally focuses on hard examples.

The γ parameter gives explicit control over how aggressively to focus, and reduces to standard CE when γ = 0 for backwards compatibility.

Demerits

Extra hyperparameter γ to tune (typically 1-5, with 2 being a common default). Adds slight compute overhead per training step.

Can over-focus on hard examples that are actually mislabeled, increasing the influence of label noise.

§ 20Hinge loss

The loss that defines support vector machines. Hinge loss treats classification as a margin-maximization problem rather than a probability problem. As long as the model's score for the true class beats every other class's score by at least 1, the loss is zero. If a wrong class comes within 1 of the right class, the loss starts to grow linearly.

Hinge(y, ŷ) = max(0, 1 − y · ŷ) For binary y ∈ {−1, +1}. Zero loss once margin is satisfied; linear penalty otherwise.

Used in Support Vector Machines Margin-based ranking Energy-based models Some retrieval models

Merits

Margin-based — once the model has classified an example correctly with enough confidence, that example stops contributing to the loss. Focuses learning on the hard, near-boundary cases.

Theoretical guarantees about generalization (statistical learning theory was built around SVMs).

Demerits

Not differentiable at the kink. Doesn't produce probability outputs — calibration requires post-hoc methods like Platt scaling.

Largely superseded by cross-entropy for deep learning. Cross-entropy continues to provide gradient signal even for confident-correct predictions, which empirically converges better with neural networks.

§ 21KL divergence

Cross-entropy measures the loss between a prediction and a hard label. KL divergence generalizes this to measure the "distance" between two full probability distributions. It's not symmetric — KL(P‖Q) is not the same as KL(Q‖P) — but it's zero exactly when the two distributions match, and grows when they differ.

KL(p ‖ q) = Σᵢ pᵢ · log(pᵢ / qᵢ) Reduces to cross-entropy plus a target entropy term: KL(p‖q) = H(p, q) − H(p).

Used in VAE prior matching (KL term in ELBO) Knowledge distillation (between teacher and student) PPO (clipped policy ratio is KL-like) RLHF reward model regularization DPO objective

Example use — VAE

The VAE loss has two terms: a reconstruction term (usually MSE or BCE between input and output) and a KL term that pulls the encoder's posterior toward a standard normal prior: KL(q(z|x) ‖ N(0, I)). The KL term is what gives the VAE its smooth, continuous latent space — without it, the encoder would just memorize.

Merits

The natural measure of distribution difference. Has a clean information-theoretic interpretation (extra bits needed to encode samples from p using a code optimized for q).

Foundational for many modern training paradigms — VAEs, distillation, RLHF — all build on KL.

Demerits

Asymmetric. KL(p‖q) penalizes q being small where p is large, but not the reverse. "Reverse KL" KL(q‖p) is sometimes used for different behavior — choosing one over the other has real consequences for what the model learns.

Undefined when q has zero probability where p has positive probability. Numerical care required.

§ 22Label smoothing

Label smoothing isn't a new loss function — it's a modification of the targets used inside cross-entropy. Instead of using one-hot labels (1 for the true class, 0 for all others), label smoothing uses 1 − ε for the true class and ε / (K − 1) for each other class. The model is no longer asked to be 100% confident in any prediction.

target_smoothed = (1 − ε) · onehot + ε · uniform Typical ε is 0.1. Effectively penalizes overconfidence, improves calibration.

Used in Inception-v3 / ImageNet Transformer (Vaswani et al.) Most modern image classifiers Machine translation

Merits

Improves model calibration — predicted probabilities better match true frequencies. Useful when downstream decisions depend on confidence, not just argmax.

Acts as a regularizer, often improving validation accuracy. Standard in transformer training since the original "Attention Is All You Need" paper.

Demerits

Bounds the maximum achievable probability — the model can never output 1.0 for a class. Slightly harms top-1 accuracy in rare cases.

Less effective when combined with knowledge distillation, which already provides soft targets.

Group C

Embedding & contrastive losses

When the goal is to learn similarity, not classification. The training signal compares pairs (or sets) of examples rather than examples to labels.

§ 23Triplet loss

The classic embedding-learning loss. A triplet consists of an anchor, a positive example (similar to the anchor) and a negative (dissimilar). The loss pushes the anchor-positive distance below the anchor-negative distance by at least a margin α. After training, similar items cluster together in embedding space and dissimilar ones are far apart.

L = max(0, d(a, p) − d(a, n) + α) d is usually Euclidean distance. α is the margin (typically 0.2 for normalized embeddings).

Used in FaceNet face recognition Image retrieval Person re-identification Speaker verification

Example use — face recognition

FaceNet trains a 128-dimensional embedding such that two photos of the same person are closer in embedding space than photos of different people. After training, identifying a person reduces to a nearest-neighbor lookup in the embedding database — no classifier needed, and new identities can be added without retraining.

Merits

Produces general-purpose embeddings — once trained, the same network can score similarity for any pair, including for classes never seen during training (zero-shot recognition).

The margin parameter gives intuitive control over how strongly to separate classes.

Demerits

Triplet selection is critical and difficult. Easy triplets (where the model already satisfies the margin) contribute zero gradient. Hard negative mining — finding triplets that violate the margin — is a research problem in itself.

Quadratic or cubic explosion in the number of possible triplets. Has been largely superseded by batch-level losses like InfoNCE.

§ 24Contrastive loss (pairwise)

The predecessor to triplet loss. Operates on pairs labeled as "similar" or "dissimilar". For similar pairs, minimize distance directly. For dissimilar pairs, push apart up to a margin — beyond that, contribute no gradient.

L = y · d² + (1 − y) · max(0, m − d)² y = 1 for similar pair, 0 for dissimilar. m is the margin.

Used in Siamese networks Signature verification Early metric learning

Merits

Conceptually simple — just pairs. Easier to construct training batches than triplets, especially in semi-supervised settings.

Demerits

Doesn't enforce a margin between classes — only an absolute distance threshold for dissimilar pairs. Triplet loss provides better gradients for cluster separation.

Largely superseded by triplet and InfoNCE losses.

§ 25InfoNCE — the modern contrastive loss

InfoNCE (Noise Contrastive Estimation) is what made contrastive learning explode in 2020. Instead of working with single triplets, it treats one batch as a giant classification problem: given an anchor, identify the correct positive among N − 1 negatives. The "loss" is just multi-class cross-entropy where the classes are the items in the batch and the logits are similarities.

L_NCE = −log(exp(sim(a, p)/τ) / Σⱼ exp(sim(a, xⱼ)/τ)) sim is usually cosine similarity. τ is a temperature controlling sharpness. Negatives come from the rest of the batch.

Used in SimCLR (image self-supervised learning) MoCo CLIP (image-text) DINO Sentence embedding models (SBERT) Retrieval-augmented generation embeddings

Example use — CLIP

CLIP trains on 400M image-text pairs scraped from the internet. Each batch of N pairs becomes an N × N similarity matrix: row i, column j is the cosine similarity between image i's embedding and text j's embedding. The loss is cross-entropy with the diagonal as the correct match — image i should pair with text i, not any of the N − 1 other texts in the batch. This single objective produces an embedding space where text and image align without any direct supervision of what they mean.

Merits

Batch-level: every other example in the batch automatically becomes a negative, eliminating triplet-selection problems. Larger batches → harder negatives → better learning.

Works completely self-supervised — augmentations of the same image are "positives", everything else is negative. SimCLR showed this rivals supervised pretraining.

Foundation of modern multimodal models (CLIP) and text embedding models (E5, BGE, OpenAI embeddings).

Demerits

Requires very large batch sizes (typically 1024-32768) for effective negative sampling. Memory-hungry; needs gradient accumulation tricks or special infrastructure.

Sensitive to temperature τ. Sensitive to negative composition — false negatives (semantically similar items treated as dissimilar) hurt learning.

Group D

Sequence & LLM training losses

From pretraining a transformer to aligning a chatbot. Each stage of LLM training has its own loss function.

§ 26Next-token cross-entropy — the pretraining loss

The objective that powers every autoregressive language model. At each position in a sequence, the model predicts a probability distribution over the vocabulary for what comes next. The loss is cross-entropy between that distribution and the actual next token. Average across the entire sequence, average across the entire dataset, and you have the pretraining loss for GPT-2, GPT-3, GPT-4, LLaMA, Mistral, and every other autoregressive LLM.

L = −(1/T) · Σ_t=1^T log p_θ(x_t | x_<t) Average negative log-likelihood of each token given all previous tokens.

This is just categorical cross-entropy applied auto-regressively. What makes it powerful is the scale — train on trillions of tokens with this single objective, and the model learns grammar, facts, reasoning patterns, code, and arguably much more, all from predicting the next token.

Used in GPT family (1-4+) LLaMA family Mistral, Mixtral Claude, Gemini pretraining All autoregressive LLMs

Example use

Given the sequence "The capital of France is", a pretrained model outputs a distribution where "Paris" has high probability, "Lyon" has lower probability, and "banana" has near-zero probability. The loss is −log P(Paris | context). Across 13T training tokens, billions of such updates carve out the model's knowledge of geography, syntax, and everything else.

# In practice: shift labels by one position logits = model(input_ids) # [B, T, V] shift_logits = logits[:, :-1, :].contiguous() shift_labels = input_ids[:, 1:].contiguous() loss = F.cross_entropy( shift_logits.view(-1, vocab_size), shift_labels.view(-1) )

Merits

Self-supervised — no labels required. Any text becomes training data, which is why LLMs can scale to trillions of tokens of web text.

Single simple objective produces extraordinarily general capabilities. The bitter lesson — that scale plus a simple objective beats sophisticated handcrafted methods — was first proven here.

Maximum likelihood estimation, statistically principled.

Demerits

Teacher forcing during training — at every position the model conditions on the true previous tokens, not its own predictions. This creates an "exposure bias" between training and inference where errors compound.

Loss treats all tokens equally, including filler tokens and stop words that carry little information. Better importance weighting is an open research area.

Doesn't directly optimize what we care about (helpful, harmless responses) — only the likelihood of the training distribution.

§ 27Masked language modeling loss (BERT)

BERT took a different approach. Instead of predicting the next token, it randomly masks 15% of tokens and asks the model to recover them, conditioning on the rest of the sequence — both left and right context. This is what makes BERT bidirectional, and what makes it good at understanding tasks rather than generation tasks.

L_MLM = −Σ_{i ∈ masked} log p_θ(x_i | x_\masked) Only masked positions contribute to the loss. Unmasked positions just provide context.

The 15% masking ratio was chosen empirically. Of those 15%, the original paper replaces 80% with [MASK], 10% with a random token, and leaves 10% unchanged — a hack to reduce train-test mismatch since [MASK] never appears at inference.

Used in BERT RoBERTa DeBERTa DistilBERT Most encoder-only models

Merits

Bidirectional context — every position sees both left and right context, unlike autoregressive models. Better for understanding tasks (classification, NER, QA).

Self-supervised like autoregressive pretraining; any text is training data.

Demerits

Only 15% of positions produce loss signal per batch — sample efficiency is roughly 6× worse than autoregressive training. To match GPT-style total signal, MLM needs 6× more compute.

Can't directly generate text without specialized decoding. Largely displaced by decoder-only autoregressive models for almost all tasks since 2022.

§ 28Perplexity

Perplexity isn't a training loss — it's the most common evaluation metric for language models, and it's just the exponential of the cross-entropy loss. If cross-entropy is "average bits per token of surprise" (in natural log units), perplexity is "effective vocabulary size the model is choosing among at each step." Lower is better.

PPL = exp(L_CE) = exp(−(1/T) · Σ log p(x_t | x_<t)) A model with PPL = 10 is on average choosing among an effective 10 tokens per position. Uniform over 50k vocab would give PPL = 50,000.

Used in LM evaluation Comparing model checkpoints Reporting in LM papers

Merits

Bounded interpretation — "effective vocabulary size" gives an intuitive way to compare models. A drop from PPL 20 to PPL 10 means the model is twice as effective at narrowing down each next token.

Standardized across LM literature, making cross-paper comparison straightforward.

Demerits

Doesn't measure what users care about. Lower perplexity correlates with helpfulness, factuality, and reasoning, but the correlation is loose at the top end. Two models with identical perplexity can have very different chat quality.

Sensitive to tokenization. Two models with different tokenizers can't be compared directly via perplexity.

§ 29Knowledge distillation loss

Distillation transfers knowledge from a large "teacher" model to a smaller "student" model. The student learns to match the teacher's softmax distribution, not just the hard target labels. The teacher's "dark knowledge" — the relative probabilities it assigns to non-target classes — turns out to be more informative than the labels alone.

L = α · L_CE(student, true labels) + (1 − α) · T² · KL(student_T ‖ teacher_T) T is a temperature applied to both softmaxes (typically 2-10). Higher T reveals more of the teacher's "soft" structure.

Used in DistilBERT (40% smaller, 95% of BERT) TinyBERT MobileBERT Smaller LLM variants (e.g., Llama distillations) Gemma teacher-student training

Example use

DistilBERT is trained to match BERT's output distribution at temperature 2.0. For a sentence the teacher classifies as 70% positive, 25% neutral, 5% negative, the student is trained to produce the same distribution — not just to predict "positive". The student learns that positive and neutral are similar (because the teacher said so), even though the hard label loses this information.

Merits

Produces smaller, faster models with much of the teacher's performance. DistilBERT keeps 95% of BERT's GLUE score at 40% the size.

Soft targets carry more information than hard labels — the teacher's probability distribution acts as a richer supervision signal.

Demerits

Requires a strong teacher already trained. Two hyperparameters (α, T) to tune. The student inherits the teacher's biases and errors.

Distillation gains diminish as the student approaches the teacher's capacity — you can't make an arbitrary student match an arbitrary teacher.

§ 30SFT — supervised fine-tuning

After pretraining produces a base language model that knows how to predict text, supervised fine-tuning teaches it to follow instructions. The data is curated: humans write (or curate) example prompts and ideal responses. The loss is the same next-token cross-entropy as pretraining, but applied only to the response tokens, not the prompt.

L_SFT = −(1/|response|) · Σ_{t ∈ response} log p_θ(x_t | x_<t) Cross-entropy masked to response tokens only. Loss on the prompt is set to zero.

Used in InstructGPT ChatGPT, Claude, Gemini fine-tuning LLaMA-Chat, Mistral-Instruct All instruction-tuned LLMs

Example use

A pretrained base model continues "What is the capital of France?" with something like "What is the capital of Germany? What is..." (continuing the pattern). After SFT on instruction-response pairs, it answers "The capital of France is Paris." The loss is computed only on the answer; the model already knows how to read prompts.

Merits

The simplest and most effective way to turn a base model into an instruction-follower. A few thousand high-quality examples can dramatically change model behavior.

Same loss as pretraining — no new training infrastructure needed.

Demerits

Only as good as the demonstrations. The model learns to mimic the surface style of the responses, not the underlying preferences that produced them.

Doesn't capture preferences over multiple possible responses — humans rarely write the single "best" answer, just a reasonable one.

§ 31Reward model loss (Bradley-Terry)

To go beyond SFT, modern alignment uses human preferences. Annotators are shown two model responses to the same prompt and asked which is better. The reward model is trained to assign higher scalar scores to preferred responses than to dispreferred ones. The loss is derived from the Bradley-Terry preference model — under which the probability that response y₁ is preferred over y₂ is the sigmoid of their reward difference.

L_RM = −log σ(r_θ(x, y_w) − r_θ(x, y_l)) y_w is the preferred ("winning") response, y_l is the dispreferred ("losing") response.

Used in InstructGPT reward model RLHF for ChatGPT, Claude, Gemini Preference modeling research LLaMA 2-Chat preference models

Example use

Given a prompt and two responses, the reward model outputs scalar scores 0.82 and 0.31. The training signal: σ(0.82 − 0.31) = σ(0.51) ≈ 0.625. Since the human said y₁ was preferred, the loss is −log(0.625) ≈ 0.47, pulling reward(y₁) up and reward(y₂) down.

Merits

Captures relative preferences without requiring annotators to produce absolute scores — much easier to collect reliably.

Once trained, the reward model becomes a stand-in for human judgment at scale, enabling RL fine-tuning on millions of generated responses.

Demerits

Reward hacking — the policy can find ways to score high reward that don't actually align with what humans want. The reward model is an imperfect proxy.

Annotator disagreement is high, and the Bradley-Terry assumption (transitive, scalar utility) often doesn't hold.

§ 32PPO — Proximal Policy Optimization (RLHF)

With a reward model in hand, the next step is to fine-tune the language model to produce responses that score high under the reward model — without drifting too far from the original SFT model and losing fluency. PPO is the policy gradient algorithm that achieves this. The objective combines the expected reward with a KL penalty against a frozen reference (the SFT model) to prevent reward hacking.

L_PPO = 𝔼[min(r_t(θ)·A_t, clip(r_t(θ), 1−ε, 1+ε)·A_t)] − β · KL[π_θ ‖ π_ref] r_t(θ) is the policy ratio, A_t is the advantage, clip prevents huge updates, β controls the KL penalty.

The clipping is what gives PPO its name: by clipping the policy ratio to [1 − ε, 1 + ε], it prevents catastrophic policy updates that would take the model far from where it can still produce coherent text. The KL term on top of that ensures the policy stays in a neighborhood of the SFT model.

Used in InstructGPT ChatGPT (original alignment) Claude (Anthropic's RLAIF variant) LLaMA 2-Chat Most pre-2024 RLHF pipelines

Merits

The proven workhorse of LLM alignment from 2022-2023. Empirically effective at improving helpfulness and harmlessness above what SFT achieves alone.

The KL penalty against the reference policy prevents the model from degenerating into reward-hacking gibberish.

Demerits

Complex pipeline: requires a trained reward model, a frozen reference policy, an actor model, and often a value/critic model — four model instances at once. Memory and compute hungry.

Many hyperparameters to tune (KL coefficient, clip range, learning rate, number of PPO epochs). Unstable training, sensitive to reward model quality.

Largely being replaced by DPO and related methods that achieve similar results without the RL machinery.

§ 33DPO — Direct Preference Optimization

The breakthrough paper of 2023. DPO showed that PPO's three-stage process (train reward model → run PPO with KL penalty against reference) is mathematically equivalent to a single supervised loss directly on preference pairs. The reward model is implicit, encoded in the policy itself. No reward training, no rollouts, no PPO — just a single-stage loss almost as simple as SFT.

L_DPO = −log σ(β · log(π_θ(y_w|x) / π_ref(y_w|x)) − β · log(π_θ(y_l|x) / π_ref(y_l|x))) π_θ is the trainable policy, π_ref the frozen SFT model. β controls deviation from reference.

The intuition: every preference pair (chosen, rejected) provides a signal. The loss increases the relative likelihood the policy assigns to chosen responses and decreases it for rejected ones — but always weighted against what the reference model thought, so the policy can't drift too far.

Used in Zephyr-7B Tülu 2 Mistral preference variants LLaMA 3 preference fine-tuning Most open-source preference-tuned models since 2024

Example use

A preference dataset has 60k pairs of chat responses. The DPO loop is essentially: take a batch of pairs, compute log-probabilities of chosen and rejected responses under both the current policy and the frozen reference, plug into the formula above, backprop. No reward model, no PPO, no rollouts. A V100 can run this; PPO requires a cluster.

Merits

Dramatically simpler than PPO — one model to train, no rollouts, no reward model. Approximately matches PPO's quality while being orders of magnitude cheaper.

Stable training without RL's sensitivity to hyperparameters. Standard cross-entropy-flavored loss that any researcher can debug.

Has rapidly become the default preference-tuning method in open-source.

Demerits

The reference model is fixed during training, which can over-constrain the policy. Several extensions (DPOP, KTO) address this.

Can produce models that are overly confident on training preferences but degrade on out-of-distribution prompts. The implicit reward is not separately validated.

The β hyperparameter is critical and dataset-dependent.

§ 34KTO, ORPO, IPO — the DPO family

DPO inspired an entire family of follow-ups, each addressing some weakness of the original.

KTO — Kahneman-Tversky Optimization

DPO requires preference pairs (a chosen and rejected response for the same prompt). KTO works with single labeled examples — each response is just marked "good" or "bad", no pair required. It uses prospect theory (Kahneman-Tversky) to model human utility asymmetrically: losses hurt more than equivalent gains. This is more data-efficient when you have abundant binary feedback rather than pairwise preferences.

L_KTO = w · (1 − σ(β · (log π_θ(y|x)/π_ref(y|x) − z₀))) w depends on whether y is desirable or undesirable. z₀ is a reference value (the running KL).

ORPO — Odds Ratio Preference Optimization

ORPO combines SFT and preference optimization into a single stage. The loss is SFT plus a small odds-ratio penalty that pushes the model away from dispreferred responses. By doing both at once, ORPO skips the SFT-then-DPO two-stage pipeline entirely.

L_ORPO = L_SFT(y_w) − λ · log σ(log(p(y_w)/(1−p(y_w))) − log(p(y_l)/(1−p(y_l)))) Combines next-token CE on the winner with an odds-ratio penalty against the loser.

IPO — Identity Preference Optimization

DPO's loss has been shown to overfit on preference pairs — it can push toward extreme probability assignments. IPO replaces the log-sigmoid with a squared difference, producing a "smoother" loss that's more resistant to overfitting on noisy preference data.

L_IPO = (log(π_θ(y_w)/π_ref(y_w)) − log(π_θ(y_l)/π_ref(y_l)) − 1/(2β))² Quadratic penalty instead of log-sigmoid; bounded gradient prevents preference-fitting blowup.

Collective strengths

The family offers a spectrum of trade-offs: KTO for binary feedback, ORPO for unified training, IPO for noisy preferences. All inherit DPO's simplicity over PPO.

Easier to ablate and compare than RL methods, accelerating empirical progress in alignment.

Collective weaknesses

No single winner has emerged. Empirical performance varies by dataset, base model, and evaluation method.

All inherit DPO's fundamental limitation: a frozen reference policy. Long training runs can hit a wall where the reference is too far from where the policy needs to go.

§ 39Loss function reference

Name	Formula	Use case
MSE / L2	(1/n) Σ(y − ŷ)²	Default regression; Gaussian noise assumption
MAE / L1	(1/n) Σ\|y − ŷ\|	Robust regression; outlier-prone targets
Huber	quadratic for \|r\|≤δ, linear else	Bounding-box regression; RL value targets
Quantile	max(τr, (τ−1)r)	Prediction intervals; probabilistic forecasts
Binary CE	−[y log ŷ + (1−y) log(1−ŷ)]	Binary classification; multi-label
Categorical CE	−log ŷ_true	Multi-class classification; default deep classification
Focal	−(1−pₜ)^γ log pₜ	Severely imbalanced classification; dense detection
Hinge	max(0, 1 − y·ŷ)	SVMs; margin-based classification
KL divergence	Σ p log(p/q)	VAE; distillation; RL policy regularization
Label smoothing	CE with target = (1−ε) onehot + ε/K	Image classification; transformer training
Triplet	max(0, d(a,p) − d(a,n) + α)	Face recognition; embedding learning
Contrastive (pair)	y·d² + (1−y)·max(0, m−d)²	Siamese networks; signature verification
InfoNCE	−log(exp(sim₊/τ) / Σexp(sim/τ))	CLIP; SimCLR; sentence embeddings; modern retrieval
Next-token CE	−Σ log p(x_t\|x_<t)	LLM pretraining; autoregressive generation
Masked LM	−Σ_masked log p(x_i\|context)	BERT-family encoder pretraining
Distillation	α·CE + (1−α)·T²·KL	DistilBERT; smaller LLM variants; mobile models
SFT	next-token CE on response tokens only	Instruction tuning; first stage of alignment
Reward model (BT)	−log σ(r(y_w) − r(y_l))	Training reward models for RLHF
PPO	clipped policy gradient + KL penalty	RLHF alignment (ChatGPT, LLaMA-Chat era)
DPO	−log σ(β·log π/π_ref ratio difference)	Modern alignment; preference tuning without RL
KTO	prospect-theoretic single-label preference	Binary good/bad feedback; unbalanced label data
ORPO	SFT + odds-ratio penalty	Single-stage instruction + preference tuning
IPO	squared margin (smoother DPO)	Noisy preference data; long training runs

A final structural observation. The history of losses is a history of asking better questions. MSE asks "how far off?" Cross-entropy asks "how confident, in which direction?" Triplet asks "is this closer than that?" InfoNCE asks "which one among N is the match?" DPO asks "which response do you prefer?" Each shift unlocked new capabilities — robust regression, calibrated classification, learned embeddings, multimodal alignment, instruction following. Pick the loss that asks the question you actually want answered.