Deep Learning · Companion notes

Activation & loss functions,
visualized.

A working tour of the two function families that drive deep learning — from sigmoid to SwiGLU, from MSE to DPO. What each one is, why it exists, what it's used for, and where it breaks down.

30+ functions, 7 interactive demos Approx. 40 min read
Part I
Activation functions
The element-wise nonlinearities applied after every linear layer. Without them, a neural network of any depth would be equivalent to a single linear layer.

§ 01Why activations exist

A linear layer computes y = Wx + b. Stack two of them: y = W₂(W₁x + b₁) + b₂ = (W₂W₁)x + (W₂b₁ + b₂). Two linear layers collapse into one. Three collapse into one. A thousand collapse into one. No matter how deep you make a purely linear network, it can only represent linear functions.

The job of an activation function is to break linearity. A single element-wise nonlinearity inserted between layers is enough to make a network a universal approximator — capable, in principle, of representing any continuous function given enough width.

That's the conceptual role. The practical role is harder: an activation must be cheap to compute, easy to differentiate, and produce well-behaved gradients during backpropagation. The history of deep learning is largely the history of finding activations that satisfy all three. Sigmoid saturates and kills gradients. ReLU is fast but kills neurons. GELU and Swish are smooth, gradient-friendly, and now dominate large models.

The demo below plots every common activation on the same axes. Pick a function from the dropdown to see its shape, its derivative, and its key properties at a glance.

Demo 01 · Activation function explorer
function
activation f(x) derivative f'(x) y = 0
Range
[0, ∞)
unbounded above
Smooth?
No
kink at x = 0
Zero-centered?
No
outputs always ≥ 0
Cost
Cheap
single max op

§ 02Sigmoid (logistic)

The first nonlinearity that mattered. Sigmoid squashes any real number into the open interval (0, 1), making its output interpretable as a probability. This is exactly what early neural networks needed for binary classification — the output of the final neuron is the model's belief that the input belongs to the positive class.

σ(x) = 1 / (1 + e−x)  ·  σ'(x) = σ(x)(1 − σ(x)) Output range (0, 1). Derivative peaks at 0.25 when x = 0.
Used in Binary classification output Logistic regression Old MLPs LSTM/GRU gates Attention masks
Example use
The final layer of a sentiment classifier outputs z = 1.4. Applying sigmoid gives σ(1.4) ≈ 0.80 — interpreted as 80% confidence that the review is positive. Pair with binary cross-entropy for training.
Merits

Produces a clean probability interpretation, making it the natural choice for binary output layers and gating mechanisms (which need values in [0, 1] to act as "soft switches").

Smooth everywhere with a derivative that can be expressed in terms of the output itself — convenient for efficient backpropagation.

Demerits

Vanishing gradients. The derivative is at most 0.25, and approaches zero in the saturated tails. Stack a few layers and gradients shrink exponentially — training stalls.

Not zero-centered. All outputs are positive, which biases the gradient direction during training and slows convergence.

Exponential is relatively expensive compared to ReLU's max(0, x).

§ 03Tanh — hyperbolic tangent

Tanh is sigmoid's zero-centered cousin. It maps real numbers to (−1, 1) instead of (0, 1), preserving sign information and centering outputs on zero. For decades this was the default activation in hidden layers, because the centered output produces more balanced gradients and faster convergence than sigmoid.

tanh(x) = (ex − e−x) / (ex + e−x)  ·  tanh'(x) = 1 − tanh²(x) Output range (−1, 1). Derivative peaks at 1 when x = 0 — four times sigmoid's peak.
Used in RNN hidden states LSTM cell outputs Old CNN classifiers VAE encoder outputs (sometimes)
Example use
An LSTM cell uses tanh twice — once to squash the candidate cell state to (−1, 1), and again to squash the output gate. The bounded range prevents the cell state from exploding.
Merits

Zero-centered output dramatically helps gradient flow compared to sigmoid. The peak derivative is 1.0, four times sigmoid's peak.

Still bounded, which keeps activations stable and is critical for recurrent networks where activations are reused across many time steps.

Demerits

Still saturates in both tails — large positive or negative inputs produce near-zero gradients. Vanishing gradient problem only slightly mitigated, not solved.

Two exponentials per evaluation. Replaced almost everywhere by ReLU and friends in feedforward layers.

§ 04ReLU — rectified linear unit

The activation that made deep learning practical. ReLU is shockingly simple — output the input if positive, zero otherwise — but this simplicity changed everything. For the positive half, the derivative is exactly 1, so gradients flow through deep stacks without attenuation. Training networks with 50, 100, 500 layers became feasible.

ReLU(x) = max(0, x)  ·  ReLU'(x) = 1 if x > 0 else 0 Output range [0, ∞). Constant gradient of 1 for positive inputs solves vanishing gradients in feedforward nets.
Used in AlexNet, VGG, ResNet hidden layers Most CNNs Most MLPs Transformer FFN (original)
Example use
In ResNet-50, every convolutional layer is followed by batch normalization and ReLU. A typical activation map after a conv layer has roughly half its values clipped to zero — this sparsity is part of why ReLU networks work well.
Merits

Trivially cheap to compute — a single comparison. Drives most of the speed gains in modern deep learning.

Non-saturating for positive inputs. Gradient of exactly 1 means deep networks train without vanishing gradients in the forward direction.

Induces sparse activations — roughly half of all neurons output zero for any given input, providing a form of natural regularization and computational efficiency.

Demerits

Dying ReLU problem. A neuron whose pre-activation becomes very negative gets stuck at zero output and zero gradient — permanently dead. Large learning rates can kill 40%+ of neurons.

Not differentiable at x = 0 (subgradient used in practice). Outputs are unbounded above, which can lead to exploding activations without normalization.

Not zero-centered.

§ 05Leaky ReLU and PReLU

Leaky ReLU was the first attempt to address dying neurons. Instead of zeroing negative inputs entirely, it lets a small fraction through — typically 0.01. The negative side still has a gradient, so dead neurons can come back to life. PReLU (Parametric ReLU) takes this one step further by making the slope a learnable parameter, optimized jointly with the network weights.

LeakyReLU(x) = x if x > 0 else αx  ·  α typically 0.01 PReLU: α is a learned per-channel parameter rather than fixed.
Used in GANs (especially the discriminator) Image generation networks PReLU famously in ResNet-200 variants
Example use
In a GAN discriminator, dying ReLU is catastrophic — a dead discriminator can no longer distinguish real from fake images. Leaky ReLU with α = 0.2 is the standard choice. The gradient through negative inputs keeps the discriminator alive throughout training.
Merits

Eliminates the dying neuron problem at essentially zero additional compute. The small slope on the negative side keeps gradients flowing.

PReLU lets the network learn the optimal slope per channel, sometimes giving small but consistent accuracy improvements.

Demerits

The leak coefficient α is arbitrary — 0.01 is conventional but not principled. PReLU partially fixes this but adds learnable parameters and complicates regularization.

Empirical improvements over ReLU are inconsistent. Many practitioners find no meaningful gain on standard image classification benchmarks.

§ 06ELU and SELU

ELU (Exponential Linear Unit) uses an exponential curve on the negative side, smoothly approaching −α as x → −∞. This is bounded — unlike Leaky ReLU's unbounded negative side — and smooth at zero. SELU (Scaled ELU) carefully chooses α and a scaling factor λ such that activations through deep networks self-normalize to zero mean and unit variance, removing the need for batch normalization.

ELU(x) = x if x > 0 else α(ex − 1)
SELU(x) = λ · ELU(x) with α ≈ 1.6733, λ ≈ 1.0507 SELU's constants are not arbitrary — they're derived from a fixed-point analysis of mean and variance through deep nets.
Used in Self-Normalizing Networks (SNNs) MLPs without BatchNorm Sparse data scenarios
Merits

ELU's negative saturation makes outputs robust to noise. Smooth at zero, so optimization landscapes are slightly nicer than ReLU's.

SELU enables training very deep networks without batch normalization, when paired with proper weight initialization (LeCun normal) and alpha-dropout.

Demerits

Exponential is more expensive than ReLU's comparison. In practice, ELU is rarely chosen over ReLU on speed-sensitive workloads.

SELU's self-normalization only works under very specific conditions — particular initialization, particular dropout variant, no skip connections. Brittle and largely superseded by LayerNorm/BatchNorm in practice.

§ 07GELU — Gaussian Error Linear Unit

GELU is where activation functions enter the era of large language models. Introduced in 2016 and adopted by BERT, GPT-2, GPT-3, and most transformer encoders, it's a smooth nonlinearity that weights inputs by their value under the standard normal CDF. The intuition is that GELU multiplies x by the probability that a Gaussian draw exceeds zero — a probabilistic version of ReLU's hard gate.

GELU(x) = x · Φ(x) ≈ 0.5 · x · (1 + tanh(√(2/π)(x + 0.044715 x³))) Φ is the standard normal CDF. The right form is a fast approximation used in practice.
Used in BERT GPT-2, GPT-3 RoBERTa Vision Transformer (ViT) T5 (in some variants)
Example use
Every transformer block in BERT-base contains a two-layer feedforward network: FFN(x) = GELU(xW₁ + b₁)W₂ + b₂. With hidden size 3072 between the two linear layers, this FFN is where most of BERT's parameters live and where GELU does its work.
Merits

Smooth everywhere (including at x = 0), which optimization theory mildly prefers. The non-monotonic dip near x = −1 lets it represent slightly more complex behaviors than ReLU.

Empirically outperforms ReLU on transformer architectures, especially at scale. Became the default for transformer feedforward layers throughout 2018-2021.

Demerits

Considerably more expensive than ReLU — the exact form needs an erf evaluation, the approximate form needs tanh plus a cubic.

The advantage over ReLU shrinks with very large models and may not justify the compute cost. Newer architectures favor SwiGLU instead.

§ 08Swish (SiLU)

Discovered by Google's neural architecture search in 2017, Swish is even simpler than GELU but behaves similarly. It's just x · σ(x) — the input gated by its own sigmoid. PyTorch calls the same function SiLU (Sigmoid Linear Unit). The two names refer to the same function, with Swish sometimes generalized to include a learnable parameter β: x · σ(βx).

Swish(x) = x · σ(x) = x / (1 + e−x) Smooth, non-monotonic, unbounded above, bounded below by approximately −0.28.
Used in EfficientNet MobileNetV3 LLaMA family (in SwiGLU) PaLM YOLO variants
Merits

Slightly cheaper than GELU's approximation (just sigmoid times input). Empirically matches or beats GELU on most large-model benchmarks.

The non-monotonic dip allows a tiny amount of negative output for moderately negative inputs, providing more expressive gradients than ReLU.

Demerits

Still more expensive than ReLU. The gains are real but small (often < 1% accuracy) and may not justify the cost in inference-sensitive applications.

Sigmoid evaluation is required at every forward pass, which on mobile hardware can be a meaningful bottleneck.

§ 09Mish

Mish takes the Swish idea further by replacing the sigmoid gate with a softplus-and-tanh composition. The result is a smooth, non-monotonic curve similar to Swish but with slightly different behavior in the negative tail. Mish gained attention through the YOLOv4 object detection paper, where replacing Leaky ReLU with Mish gave consistent improvements.

Mish(x) = x · tanh(softplus(x)) = x · tanh(ln(1 + ex)) Smooth, non-monotonic, with a slightly deeper negative dip than Swish.
Used in YOLOv4 Some image classification SOTA
Merits

Reported empirical gains over Swish and GELU on computer vision benchmarks, especially in object detection where YOLOv4 became its flagship use.

Demerits

More expensive than both Swish and GELU. The composition of softplus and tanh is computationally heavier without consistent gains elsewhere.

Not widely adopted outside vision. Transformer-based architectures have largely settled on GELU and SwiGLU.

§ 10Softmax

Softmax is the only activation on this page that operates over a vector rather than element-wise. It takes a vector of arbitrary real numbers ("logits") and produces a probability distribution — non-negative entries that sum to one. This is what turns a neural network's raw output into a categorical distribution over classes, words, or actions.

softmax(z)ᵢ = ezᵢ/T / Σⱼ ezⱼ/T T is the temperature. Higher T → flatter distribution. T → 0 → argmax (one-hot).

The temperature parameter T controls the sharpness of the resulting distribution. At T = 1 it's the standard softmax. At T → 0 the distribution collapses onto the argmax. At T → ∞ it approaches uniform. This is exactly the "temperature" you tune when generating text from an LLM.

Used in Multi-class classification output Attention weights (transformer) Mixture-of-experts routing Policy gradient action distributions LLM token sampling
Demo 02 · Softmax with temperature
Temperature1.00
low → sharp, high → flat
Logits[3.2, 2.1, 1.5, 0.8, -0.3]
Example use — attention
In transformer self-attention, the attention scores QKᵀ/√d are passed through softmax row-wise to produce attention weights. These weights are then used to compute a weighted sum of values: Attention(Q,K,V) = softmax(QKᵀ/√d)V. Softmax here is what makes attention "soft" — every token attends to every other token with some weight, rather than picking one.
Merits

Produces a valid probability distribution, which makes it the natural output for any classification task and the natural way to convert logits into a categorical for sampling.

Differentiable everywhere, with a clean derivative that combines elegantly with cross-entropy loss (the two together produce simple gradients: softmax(z) − y).

Temperature parameter gives smooth control over output sharpness, useful for distillation, exploration, and sampling diversity.

Demerits

Computing all ezᵢ values can be numerically unstable — large logits overflow. The "subtract max" trick is mandatory in any real implementation.

Cost grows linearly with vocabulary size, which becomes a bottleneck for LLMs with vocabularies of 32k-256k tokens. Various approximations (sampled softmax, adaptive softmax) exist.

Produces dense outputs — every class gets nonzero probability, which can be a problem for hard-decision tasks.

§ 11GLU, SwiGLU, GeGLU — gated linear units

These aren't activation functions in the strict sense — they're entire layer architectures. A Gated Linear Unit splits a linear projection into two halves, applies an activation to one half, and element-wise multiplies them together. The "gate" decides how much of the other half to let through. SwiGLU (Swish-GLU) and GeGLU (GELU-GLU) are the most successful variants and now dominate state-of-the-art LLM feedforward layers.

GLU(x, W, V) = (xW) ⊙ σ(xV)
SwiGLU(x) = (xW) ⊙ Swish(xV)  ·  GeGLU(x) = (xW) ⊙ GELU(xV) ⊙ is element-wise product. Note: 50% more parameters than a vanilla FFN, but typically narrower hidden size compensates.
Used in LLaMA, LLaMA 2, LLaMA 3 PaLM Mistral Gemma Qwen Most modern open-weights LLMs
Example use — LLaMA FFN
Every transformer block in LLaMA uses a SwiGLU feedforward network: FFN(x) = (Swish(xW₁) ⊙ xW₂)W₃. There are three weight matrices instead of two, but the hidden dimension is reduced by 2/3 to keep the parameter count comparable to a GELU FFN. This swap consistently improves perplexity at no inference cost.
Merits

Consistently improves quality on language modeling benchmarks compared to standard FFN layers using ReLU or GELU. The gating mechanism is more expressive.

The element-wise multiplication is essentially free; the main cost is the extra weight matrix, which can be offset by narrowing the hidden dimension.

Has become the de facto choice in open-source large language models since 2023.

Demerits

Three weight matrices instead of two complicates the implementation — naive code uses 50% more parameters before compensating.

Slightly higher memory pressure during training due to the extra activations needed for the gate.

The theoretical justification for why GLU variants work better remains thin — empirical preference, not derivation.

§ 12Activation reference

Name Formula Range Typical use
Sigmoid1 / (1 + e−x)(0, 1)Binary output, gating, attention masks
Tanh(ex−e−x)/(ex+e−x)(−1, 1)RNN/LSTM hidden states
ReLUmax(0, x)[0, ∞)CNN/MLP hidden layers; default in vision
Leaky ReLUx if x > 0 else 0.01x(−∞, ∞)GAN discriminators; cases where dying ReLU is a risk
PReLUx if x > 0 else αx (learned)(−∞, ∞)Image classification with learnable slope per channel
ELUx if x > 0 else α(ex−1)(−α, ∞)Deeper networks; smoother gradient landscape
SELUλ · ELU(x), specific α, λ(−λα, ∞)Self-normalizing networks without BN
GELUx · Φ(x)≈ (−0.17, ∞)BERT, GPT-2/3, ViT transformer FFN
Swish / SiLUx · σ(x)≈ (−0.28, ∞)EfficientNet, MobileNetV3; inside SwiGLU
Mishx · tanh(softplus(x))≈ (−0.31, ∞)YOLOv4, vision SOTA
Softplusln(1 + ex)(0, ∞)Smooth ReLU; positivity constraints (e.g. predicted variance)
Softmaxezᵢ / Σⱼ ezⱼ(0, 1) summing to 1Classification output, attention weights, action policies
SwiGLU(xW) ⊙ Swish(xV)unboundedLLaMA, PaLM, Mistral, Gemma FFN
GeGLU(xW) ⊙ GELU(xV)unboundedT5 v1.1, some modern LLMs
Part II
Loss functions
The scalar objective every neural network minimizes. Different loss functions encode different beliefs about what counts as a good prediction.
Group A
Regression losses
When the target is a continuous number, the question is "how far off."

§ 13Mean squared error (MSE / L2)

The default loss for regression. MSE penalizes the squared distance between prediction and target, then averages across the dataset. The squaring has two effects: it makes the loss always positive, and it punishes large errors disproportionately — an error of 4 contributes 16 to the loss, while an error of 1 contributes 1.

MSE = (1/n) · Σ (yᵢ − ŷᵢ)² Also called L2 loss. Squared form makes gradient proportional to error magnitude.
Used in Linear regression House price prediction Image reconstruction (autoencoders) VAE reconstruction term Q-learning value targets
Example use
A house price model predicts $420k for a house that actually sold for $400k. The squared error is (420 − 400)² = 400. If another house was off by $40k, its squared error is 1600 — four times the gradient signal, pulling training harder toward fixing the bigger miss.
# PyTorch import torch.nn.functional as F loss = F.mse_loss(pred, target) # Equivalent: loss = ((pred - target) ** 2).mean()
Merits

Smooth and differentiable everywhere, with a clean closed-form solution for linear models (OLS). Gradient is linear in the residual, making optimization well-behaved.

Statistically corresponds to maximum likelihood estimation under Gaussian noise — the right choice when residuals are approximately normal.

Demerits

Extremely sensitive to outliers. A single mislabeled data point with a huge residual dominates the gradient and can derail training.

Loss units are the square of the target's units, making the value harder to interpret. RMSE (square root of MSE) is often reported instead.

§ 14Mean absolute error (MAE / L1)

MAE is the natural alternative to MSE. Instead of squaring residuals, it takes their absolute value. This makes every error contribute linearly to the loss — an error of 4 contributes 4, not 16. The practical consequence is robustness: a single outlier carries the same weight as any other point, instead of dominating the training signal.

MAE = (1/n) · Σ |yᵢ − ŷᵢ| Also L1 loss. Gradient has constant magnitude (sign of error), so doesn't grow with error size.
Used in Robust regression Time series with outliers Forecasting (alongside MAPE) Image-to-image with sharp targets
Example use
In image super-resolution, MSE produces blurry outputs because squaring penalizes any deviation harshly, so the network averages possible details to reduce error. MAE produces sharper images because it treats a small error and a moderate one more proportionally — it's less afraid to commit to one specific detail.
Merits

Robust to outliers — single extreme errors don't dominate the gradient.

Loss value is in the same units as the target, making it directly interpretable ("on average we're off by 12.3 dollars").

Demerits

Not differentiable at zero (the absolute value has a kink). Subgradient methods work but optimization can be slightly less smooth than with MSE.

Constant gradient magnitude means the loss doesn't shrink the gradient as predictions improve — convergence to high precision is slower than MSE near the optimum.

§ 15Huber loss (smooth L1)

Huber is the compromise. It's quadratic for small errors (like MSE) and linear for large errors (like MAE). A threshold parameter δ controls where the transition happens. The result is a loss that's smooth at zero, has bounded gradient magnitude for outliers, and behaves like MSE near the optimum — getting the best of both worlds.

Lδ(y, ŷ) = ½(y − ŷ)² if |y − ŷ| ≤ δ
Lδ(y, ŷ) = δ(|y − ŷ| − ½δ) otherwise Smooth L1 in PyTorch is the special case δ = 1.
Used in Faster R-CNN bounding box regression Robust regression DQN reinforcement learning targets Object detection (most modern detectors)
Demo 03 · MSE vs MAE vs Huber
Huber δ1.00
transition point between quadratic and linear
Residual range±5
Watch how Huber stays smooth at zero like MSE but bounds the gradient like MAE for large residuals.
MSE (½r²) MAE (|r|) Huber (δ-dependent)
Example use — bounding boxes
Faster R-CNN regresses bounding box coordinates with smooth L1 (Huber with δ = 1). If MSE were used, a single hard example with a wildly wrong predicted box would dominate the gradient. With smooth L1, large errors contribute linearly, keeping training stable on diverse images.
Merits

Combines MSE's smoothness near zero with MAE's outlier robustness. Smooth at every point including zero (unlike MAE).

The transition parameter δ gives explicit control over how aggressively to treat outliers as outliers.

Demerits

Requires choosing δ — too small and it behaves like MAE everywhere; too large and it's just MSE.

Slightly more expensive than either MSE or MAE alone, and the piecewise definition complicates analytical derivations.

§ 16Quantile (pinball) loss

The losses above all predict the mean (MSE) or the median (MAE). Quantile loss generalizes this — train with quantile τ and the network learns to predict the τ-th quantile of the conditional distribution. Predict τ = 0.5 and you recover MAE (median regression); predict τ = 0.9 and you get the upper 90th percentile. This is essential for probabilistic forecasting.

Lτ(y, ŷ) = max(τ(y − ŷ), (τ − 1)(y − ŷ)) Asymmetric: penalizes under-predictions more if τ > 0.5, over-predictions more if τ < 0.5.
Used in Demand forecasting Prediction intervals Risk-aware regression Probabilistic time series (DeepAR, MQ-CNN)
Merits

Models distributions, not just point estimates. Train multiple heads at different quantiles (0.1, 0.5, 0.9) to get a complete prediction interval.

Asymmetric penalty allows tuning for asymmetric costs — e.g., under-stocking a warehouse may be worse than over-stocking, motivating τ > 0.5.

Demerits

Not differentiable at zero. Different quantile heads can produce inconsistent predictions (e.g., the 0.9 quantile prediction below the 0.5 quantile), requiring monotone constraints.

Less intuitive than MSE/MAE and harder to communicate to non-specialists.

Group B
Classification losses
When the target is a discrete label, the question becomes "how confident, in which direction."

§ 17Binary cross-entropy

The loss function that powers logistic regression and every binary neural classifier. BCE measures the negative log-likelihood of the true label under the predicted probability. If you're confident and right, the loss is near zero. If you're confident and wrong, the loss explodes — that's the asymmetry that makes it so effective.

BCE = −[y · log(ŷ) + (1−y) · log(1−ŷ)] ŷ ∈ (0, 1) is the predicted probability. Loss → ∞ if you predict 0 for a true 1.

Numerical stability matters. Computing log of a sigmoid output directly can underflow when the sigmoid is near 0 or 1. The standard implementation, BCE-with-logits, takes raw logits and combines sigmoid+log into one numerically safe operation using the log-sum-exp trick.

Used in Binary classification Logistic regression Multi-label classification (per-class BCE) GAN discriminator training Click-through rate prediction
Demo 04 · Cross-entropy loss landscape
Loss when true label = 1 Loss when true label = 0 Reference: 50% confidence
Notice the asymmetric penalty. Predicting 0.99 when the true label is 1 costs 0.010. Predicting 0.01 when the true label is 1 costs 4.6 — 460× more. This unbounded growth in the wrong direction is what forces neural networks to be honest about uncertainty.
Example use
A spam classifier outputs a sigmoid probability for each email. For a true spam email (y = 1), if the model predicts ŷ = 0.92, the loss is −log(0.92) ≈ 0.083. If it incorrectly predicts ŷ = 0.05, the loss is −log(0.05) ≈ 3.0. The 35× larger gradient at this point pushes the model to fix this confidently-wrong prediction quickly.
# Use the with_logits version for numerical stability loss = F.binary_cross_entropy_with_logits(logits, target) # Equivalent to but more stable than: loss = F.binary_cross_entropy(torch.sigmoid(logits), target)
Merits

Maximum likelihood estimator for the Bernoulli distribution — statistically principled. Unbounded penalty for confident-wrong predictions provides strong learning signal exactly when it's needed.

Combines cleanly with sigmoid: the gradient through the combined sigmoid-BCE is simply ŷ − y. No vanishing gradient when paired this way.

Demerits

Sensitive to class imbalance. A "predict no" model on 99% negative data has very low BCE despite being useless. Pair with class weights or focal loss.

Numerical issues near probabilities of 0 and 1 require careful implementation (use the with-logits variant).

§ 18Categorical cross-entropy

The multi-class generalization of BCE. The model produces a softmax distribution over K classes; cross-entropy measures the negative log-probability assigned to the true class. Only the probability assigned to the correct class matters — the rest of the distribution doesn't affect the loss directly (though it affects it indirectly through the softmax normalization).

CE = −Σᵢ yᵢ · log(ŷᵢ) = −log(ŷtrue) For one-hot targets, only the true class contributes — but softmax couples all logits together.
Used in ImageNet classification Multi-class classification everywhere Language modeling (next-token prediction) Machine translation Reinforcement learning policies
Example use
An ImageNet classifier sees a photo of a cat. The true label is class 281 out of 1000. The softmax produces ŷ₂₈₁ = 0.7, so the loss is −log(0.7) ≈ 0.36. The gradient updates push the cat-logit up and pull all other logits down — even though only the cat logit appears in the formula explicitly, every logit is updated because of softmax's denominator.
Merits

The natural maximum likelihood loss for categorical distributions. Pairs perfectly with softmax — the gradient is the difference between predicted and target distributions.

Works for any number of classes, from binary up to 50k+ vocabularies in language models.

Demerits

Treats all wrong classes identically. Predicting "cat" instead of "dog" gets the same penalty as predicting "cat" instead of "airplane" — even though one error is much more reasonable than the other.

Encourages overconfidence. The loss only goes to zero as the true class probability goes to 1, which can produce miscalibrated models. Label smoothing addresses this.

Sensitive to mislabeled data. A wrong label produces an unbounded gradient pulling the model away from the truth.

§ 19Focal loss

Focal loss was introduced for dense object detection, where the foreground/background imbalance is extreme — roughly 1000:1. Standard cross-entropy was failing because the gradient signal was dominated by the vast number of easy background examples. Focal loss multiplies cross-entropy by a factor (1 − pₜ)γ that down-weights easy examples (where the model is already correct) and up-weights hard examples (where the model is wrong or unsure).

FL(pₜ) = −(1 − pₜ)γ · log(pₜ) pₜ is the predicted probability of the true class. γ ≥ 0 controls focusing strength. γ = 0 recovers cross-entropy.
Used in RetinaNet Dense object detection Severely imbalanced classification Medical imaging (rare disease detection)
Demo 05 · Focal loss focusing parameter
γ (gamma)2.0
higher γ → more focus on hard examples
Comparison
γ = 0 recovers standard cross-entropy. At γ = 2 (the RetinaNet default), an easy example with p = 0.9 contributes 100× less than under CE.
Cross-entropy (γ = 0) Focal loss (current γ) Easy example region
Merits

Dramatically improves training on severely imbalanced data without needing to subsample or oversample. The gradient flow naturally focuses on hard examples.

The γ parameter gives explicit control over how aggressively to focus, and reduces to standard CE when γ = 0 for backwards compatibility.

Demerits

Extra hyperparameter γ to tune (typically 1-5, with 2 being a common default). Adds slight compute overhead per training step.

Can over-focus on hard examples that are actually mislabeled, increasing the influence of label noise.

§ 20Hinge loss

The loss that defines support vector machines. Hinge loss treats classification as a margin-maximization problem rather than a probability problem. As long as the model's score for the true class beats every other class's score by at least 1, the loss is zero. If a wrong class comes within 1 of the right class, the loss starts to grow linearly.

Hinge(y, ŷ) = max(0, 1 − y · ŷ) For binary y ∈ {−1, +1}. Zero loss once margin is satisfied; linear penalty otherwise.
Used in Support Vector Machines Margin-based ranking Energy-based models Some retrieval models
Merits

Margin-based — once the model has classified an example correctly with enough confidence, that example stops contributing to the loss. Focuses learning on the hard, near-boundary cases.

Theoretical guarantees about generalization (statistical learning theory was built around SVMs).

Demerits

Not differentiable at the kink. Doesn't produce probability outputs — calibration requires post-hoc methods like Platt scaling.

Largely superseded by cross-entropy for deep learning. Cross-entropy continues to provide gradient signal even for confident-correct predictions, which empirically converges better with neural networks.

§ 21KL divergence

Cross-entropy measures the loss between a prediction and a hard label. KL divergence generalizes this to measure the "distance" between two full probability distributions. It's not symmetric — KL(P‖Q) is not the same as KL(Q‖P) — but it's zero exactly when the two distributions match, and grows when they differ.

KL(p ‖ q) = Σᵢ pᵢ · log(pᵢ / qᵢ) Reduces to cross-entropy plus a target entropy term: KL(p‖q) = H(p, q) − H(p).
Used in VAE prior matching (KL term in ELBO) Knowledge distillation (between teacher and student) PPO (clipped policy ratio is KL-like) RLHF reward model regularization DPO objective
Example use — VAE
The VAE loss has two terms: a reconstruction term (usually MSE or BCE between input and output) and a KL term that pulls the encoder's posterior toward a standard normal prior: KL(q(z|x) ‖ N(0, I)). The KL term is what gives the VAE its smooth, continuous latent space — without it, the encoder would just memorize.
Merits

The natural measure of distribution difference. Has a clean information-theoretic interpretation (extra bits needed to encode samples from p using a code optimized for q).

Foundational for many modern training paradigms — VAEs, distillation, RLHF — all build on KL.

Demerits

Asymmetric. KL(p‖q) penalizes q being small where p is large, but not the reverse. "Reverse KL" KL(q‖p) is sometimes used for different behavior — choosing one over the other has real consequences for what the model learns.

Undefined when q has zero probability where p has positive probability. Numerical care required.

§ 22Label smoothing

Label smoothing isn't a new loss function — it's a modification of the targets used inside cross-entropy. Instead of using one-hot labels (1 for the true class, 0 for all others), label smoothing uses 1 − ε for the true class and ε / (K − 1) for each other class. The model is no longer asked to be 100% confident in any prediction.

targetsmoothed = (1 − ε) · onehot + ε · uniform Typical ε is 0.1. Effectively penalizes overconfidence, improves calibration.
Used in Inception-v3 / ImageNet Transformer (Vaswani et al.) Most modern image classifiers Machine translation
Merits

Improves model calibration — predicted probabilities better match true frequencies. Useful when downstream decisions depend on confidence, not just argmax.

Acts as a regularizer, often improving validation accuracy. Standard in transformer training since the original "Attention Is All You Need" paper.

Demerits

Bounds the maximum achievable probability — the model can never output 1.0 for a class. Slightly harms top-1 accuracy in rare cases.

Less effective when combined with knowledge distillation, which already provides soft targets.

Group C
Embedding & contrastive losses
When the goal is to learn similarity, not classification. The training signal compares pairs (or sets) of examples rather than examples to labels.

§ 23Triplet loss

The classic embedding-learning loss. A triplet consists of an anchor, a positive example (similar to the anchor) and a negative (dissimilar). The loss pushes the anchor-positive distance below the anchor-negative distance by at least a margin α. After training, similar items cluster together in embedding space and dissimilar ones are far apart.

L = max(0, d(a, p) − d(a, n) + α) d is usually Euclidean distance. α is the margin (typically 0.2 for normalized embeddings).
Used in FaceNet face recognition Image retrieval Person re-identification Speaker verification
Example use — face recognition
FaceNet trains a 128-dimensional embedding such that two photos of the same person are closer in embedding space than photos of different people. After training, identifying a person reduces to a nearest-neighbor lookup in the embedding database — no classifier needed, and new identities can be added without retraining.
Merits

Produces general-purpose embeddings — once trained, the same network can score similarity for any pair, including for classes never seen during training (zero-shot recognition).

The margin parameter gives intuitive control over how strongly to separate classes.

Demerits

Triplet selection is critical and difficult. Easy triplets (where the model already satisfies the margin) contribute zero gradient. Hard negative mining — finding triplets that violate the margin — is a research problem in itself.

Quadratic or cubic explosion in the number of possible triplets. Has been largely superseded by batch-level losses like InfoNCE.

§ 24Contrastive loss (pairwise)

The predecessor to triplet loss. Operates on pairs labeled as "similar" or "dissimilar". For similar pairs, minimize distance directly. For dissimilar pairs, push apart up to a margin — beyond that, contribute no gradient.

L = y · d² + (1 − y) · max(0, m − d)² y = 1 for similar pair, 0 for dissimilar. m is the margin.
Used in Siamese networks Signature verification Early metric learning
Merits

Conceptually simple — just pairs. Easier to construct training batches than triplets, especially in semi-supervised settings.

Demerits

Doesn't enforce a margin between classes — only an absolute distance threshold for dissimilar pairs. Triplet loss provides better gradients for cluster separation.

Largely superseded by triplet and InfoNCE losses.

§ 25InfoNCE — the modern contrastive loss

InfoNCE (Noise Contrastive Estimation) is what made contrastive learning explode in 2020. Instead of working with single triplets, it treats one batch as a giant classification problem: given an anchor, identify the correct positive among N − 1 negatives. The "loss" is just multi-class cross-entropy where the classes are the items in the batch and the logits are similarities.

LNCE = −log(exp(sim(a, p)/τ) / Σⱼ exp(sim(a, xⱼ)/τ)) sim is usually cosine similarity. τ is a temperature controlling sharpness. Negatives come from the rest of the batch.
Used in SimCLR (image self-supervised learning) MoCo CLIP (image-text) DINO Sentence embedding models (SBERT) Retrieval-augmented generation embeddings
Example use — CLIP
CLIP trains on 400M image-text pairs scraped from the internet. Each batch of N pairs becomes an N × N similarity matrix: row i, column j is the cosine similarity between image i's embedding and text j's embedding. The loss is cross-entropy with the diagonal as the correct match — image i should pair with text i, not any of the N − 1 other texts in the batch. This single objective produces an embedding space where text and image align without any direct supervision of what they mean.
Merits

Batch-level: every other example in the batch automatically becomes a negative, eliminating triplet-selection problems. Larger batches → harder negatives → better learning.

Works completely self-supervised — augmentations of the same image are "positives", everything else is negative. SimCLR showed this rivals supervised pretraining.

Foundation of modern multimodal models (CLIP) and text embedding models (E5, BGE, OpenAI embeddings).

Demerits

Requires very large batch sizes (typically 1024-32768) for effective negative sampling. Memory-hungry; needs gradient accumulation tricks or special infrastructure.

Sensitive to temperature τ. Sensitive to negative composition — false negatives (semantically similar items treated as dissimilar) hurt learning.

Group D
Sequence & LLM training losses
From pretraining a transformer to aligning a chatbot. Each stage of LLM training has its own loss function.

§ 26Next-token cross-entropy — the pretraining loss

The objective that powers every autoregressive language model. At each position in a sequence, the model predicts a probability distribution over the vocabulary for what comes next. The loss is cross-entropy between that distribution and the actual next token. Average across the entire sequence, average across the entire dataset, and you have the pretraining loss for GPT-2, GPT-3, GPT-4, LLaMA, Mistral, and every other autoregressive LLM.

L = −(1/T) · Σt=1T log pθ(xt | x<t) Average negative log-likelihood of each token given all previous tokens.

This is just categorical cross-entropy applied auto-regressively. What makes it powerful is the scale — train on trillions of tokens with this single objective, and the model learns grammar, facts, reasoning patterns, code, and arguably much more, all from predicting the next token.

Used in GPT family (1-4+) LLaMA family Mistral, Mixtral Claude, Gemini pretraining All autoregressive LLMs
Example use
Given the sequence "The capital of France is", a pretrained model outputs a distribution where "Paris" has high probability, "Lyon" has lower probability, and "banana" has near-zero probability. The loss is −log P(Paris | context). Across 13T training tokens, billions of such updates carve out the model's knowledge of geography, syntax, and everything else.
# In practice: shift labels by one position logits = model(input_ids) # [B, T, V] shift_logits = logits[:, :-1, :].contiguous() shift_labels = input_ids[:, 1:].contiguous() loss = F.cross_entropy( shift_logits.view(-1, vocab_size), shift_labels.view(-1) )
Merits

Self-supervised — no labels required. Any text becomes training data, which is why LLMs can scale to trillions of tokens of web text.

Single simple objective produces extraordinarily general capabilities. The bitter lesson — that scale plus a simple objective beats sophisticated handcrafted methods — was first proven here.

Maximum likelihood estimation, statistically principled.

Demerits

Teacher forcing during training — at every position the model conditions on the true previous tokens, not its own predictions. This creates an "exposure bias" between training and inference where errors compound.

Loss treats all tokens equally, including filler tokens and stop words that carry little information. Better importance weighting is an open research area.

Doesn't directly optimize what we care about (helpful, harmless responses) — only the likelihood of the training distribution.

§ 27Masked language modeling loss (BERT)

BERT took a different approach. Instead of predicting the next token, it randomly masks 15% of tokens and asks the model to recover them, conditioning on the rest of the sequence — both left and right context. This is what makes BERT bidirectional, and what makes it good at understanding tasks rather than generation tasks.

LMLM = −Σi ∈ masked log pθ(xi | x\masked) Only masked positions contribute to the loss. Unmasked positions just provide context.

The 15% masking ratio was chosen empirically. Of those 15%, the original paper replaces 80% with [MASK], 10% with a random token, and leaves 10% unchanged — a hack to reduce train-test mismatch since [MASK] never appears at inference.

Used in BERT RoBERTa DeBERTa DistilBERT Most encoder-only models
Merits

Bidirectional context — every position sees both left and right context, unlike autoregressive models. Better for understanding tasks (classification, NER, QA).

Self-supervised like autoregressive pretraining; any text is training data.

Demerits

Only 15% of positions produce loss signal per batch — sample efficiency is roughly 6× worse than autoregressive training. To match GPT-style total signal, MLM needs 6× more compute.

Can't directly generate text without specialized decoding. Largely displaced by decoder-only autoregressive models for almost all tasks since 2022.

§ 28Perplexity

Perplexity isn't a training loss — it's the most common evaluation metric for language models, and it's just the exponential of the cross-entropy loss. If cross-entropy is "average bits per token of surprise" (in natural log units), perplexity is "effective vocabulary size the model is choosing among at each step." Lower is better.

PPL = exp(LCE) = exp(−(1/T) · Σ log p(xt | x<t)) A model with PPL = 10 is on average choosing among an effective 10 tokens per position. Uniform over 50k vocab would give PPL = 50,000.
Used in LM evaluation Comparing model checkpoints Reporting in LM papers
Merits

Bounded interpretation — "effective vocabulary size" gives an intuitive way to compare models. A drop from PPL 20 to PPL 10 means the model is twice as effective at narrowing down each next token.

Standardized across LM literature, making cross-paper comparison straightforward.

Demerits

Doesn't measure what users care about. Lower perplexity correlates with helpfulness, factuality, and reasoning, but the correlation is loose at the top end. Two models with identical perplexity can have very different chat quality.

Sensitive to tokenization. Two models with different tokenizers can't be compared directly via perplexity.

§ 29Knowledge distillation loss

Distillation transfers knowledge from a large "teacher" model to a smaller "student" model. The student learns to match the teacher's softmax distribution, not just the hard target labels. The teacher's "dark knowledge" — the relative probabilities it assigns to non-target classes — turns out to be more informative than the labels alone.

L = α · LCE(student, true labels) + (1 − α) · T² · KL(studentT ‖ teacherT) T is a temperature applied to both softmaxes (typically 2-10). Higher T reveals more of the teacher's "soft" structure.
Used in DistilBERT (40% smaller, 95% of BERT) TinyBERT MobileBERT Smaller LLM variants (e.g., Llama distillations) Gemma teacher-student training
Example use
DistilBERT is trained to match BERT's output distribution at temperature 2.0. For a sentence the teacher classifies as 70% positive, 25% neutral, 5% negative, the student is trained to produce the same distribution — not just to predict "positive". The student learns that positive and neutral are similar (because the teacher said so), even though the hard label loses this information.
Merits

Produces smaller, faster models with much of the teacher's performance. DistilBERT keeps 95% of BERT's GLUE score at 40% the size.

Soft targets carry more information than hard labels — the teacher's probability distribution acts as a richer supervision signal.

Demerits

Requires a strong teacher already trained. Two hyperparameters (α, T) to tune. The student inherits the teacher's biases and errors.

Distillation gains diminish as the student approaches the teacher's capacity — you can't make an arbitrary student match an arbitrary teacher.

§ 30SFT — supervised fine-tuning

After pretraining produces a base language model that knows how to predict text, supervised fine-tuning teaches it to follow instructions. The data is curated: humans write (or curate) example prompts and ideal responses. The loss is the same next-token cross-entropy as pretraining, but applied only to the response tokens, not the prompt.

LSFT = −(1/|response|) · Σt ∈ response log pθ(xt | x<t) Cross-entropy masked to response tokens only. Loss on the prompt is set to zero.
Used in InstructGPT ChatGPT, Claude, Gemini fine-tuning LLaMA-Chat, Mistral-Instruct All instruction-tuned LLMs
Example use
A pretrained base model continues "What is the capital of France?" with something like "What is the capital of Germany? What is..." (continuing the pattern). After SFT on instruction-response pairs, it answers "The capital of France is Paris." The loss is computed only on the answer; the model already knows how to read prompts.
Merits

The simplest and most effective way to turn a base model into an instruction-follower. A few thousand high-quality examples can dramatically change model behavior.

Same loss as pretraining — no new training infrastructure needed.

Demerits

Only as good as the demonstrations. The model learns to mimic the surface style of the responses, not the underlying preferences that produced them.

Doesn't capture preferences over multiple possible responses — humans rarely write the single "best" answer, just a reasonable one.

§ 31Reward model loss (Bradley-Terry)

To go beyond SFT, modern alignment uses human preferences. Annotators are shown two model responses to the same prompt and asked which is better. The reward model is trained to assign higher scalar scores to preferred responses than to dispreferred ones. The loss is derived from the Bradley-Terry preference model — under which the probability that response y₁ is preferred over y₂ is the sigmoid of their reward difference.

LRM = −log σ(rθ(x, yw) − rθ(x, yl)) yw is the preferred ("winning") response, yl is the dispreferred ("losing") response.
Used in InstructGPT reward model RLHF for ChatGPT, Claude, Gemini Preference modeling research LLaMA 2-Chat preference models
Example use
Given a prompt and two responses, the reward model outputs scalar scores 0.82 and 0.31. The training signal: σ(0.82 − 0.31) = σ(0.51) ≈ 0.625. Since the human said y₁ was preferred, the loss is −log(0.625) ≈ 0.47, pulling reward(y₁) up and reward(y₂) down.
Merits

Captures relative preferences without requiring annotators to produce absolute scores — much easier to collect reliably.

Once trained, the reward model becomes a stand-in for human judgment at scale, enabling RL fine-tuning on millions of generated responses.

Demerits

Reward hacking — the policy can find ways to score high reward that don't actually align with what humans want. The reward model is an imperfect proxy.

Annotator disagreement is high, and the Bradley-Terry assumption (transitive, scalar utility) often doesn't hold.

§ 32PPO — Proximal Policy Optimization (RLHF)

With a reward model in hand, the next step is to fine-tune the language model to produce responses that score high under the reward model — without drifting too far from the original SFT model and losing fluency. PPO is the policy gradient algorithm that achieves this. The objective combines the expected reward with a KL penalty against a frozen reference (the SFT model) to prevent reward hacking.

LPPO = 𝔼[min(rt(θ)·At, clip(rt(θ), 1−ε, 1+ε)·At)] − β · KL[πθ ‖ πref] rt(θ) is the policy ratio, At is the advantage, clip prevents huge updates, β controls the KL penalty.

The clipping is what gives PPO its name: by clipping the policy ratio to [1 − ε, 1 + ε], it prevents catastrophic policy updates that would take the model far from where it can still produce coherent text. The KL term on top of that ensures the policy stays in a neighborhood of the SFT model.

Used in InstructGPT ChatGPT (original alignment) Claude (Anthropic's RLAIF variant) LLaMA 2-Chat Most pre-2024 RLHF pipelines
Merits

The proven workhorse of LLM alignment from 2022-2023. Empirically effective at improving helpfulness and harmlessness above what SFT achieves alone.

The KL penalty against the reference policy prevents the model from degenerating into reward-hacking gibberish.

Demerits

Complex pipeline: requires a trained reward model, a frozen reference policy, an actor model, and often a value/critic model — four model instances at once. Memory and compute hungry.

Many hyperparameters to tune (KL coefficient, clip range, learning rate, number of PPO epochs). Unstable training, sensitive to reward model quality.

Largely being replaced by DPO and related methods that achieve similar results without the RL machinery.

§ 33DPO — Direct Preference Optimization

The breakthrough paper of 2023. DPO showed that PPO's three-stage process (train reward model → run PPO with KL penalty against reference) is mathematically equivalent to a single supervised loss directly on preference pairs. The reward model is implicit, encoded in the policy itself. No reward training, no rollouts, no PPO — just a single-stage loss almost as simple as SFT.

LDPO = −log σ(β · log(πθ(yw|x) / πref(yw|x)) − β · log(πθ(yl|x) / πref(yl|x))) πθ is the trainable policy, πref the frozen SFT model. β controls deviation from reference.

The intuition: every preference pair (chosen, rejected) provides a signal. The loss increases the relative likelihood the policy assigns to chosen responses and decreases it for rejected ones — but always weighted against what the reference model thought, so the policy can't drift too far.

Used in Zephyr-7B Tülu 2 Mistral preference variants LLaMA 3 preference fine-tuning Most open-source preference-tuned models since 2024
Example use
A preference dataset has 60k pairs of chat responses. The DPO loop is essentially: take a batch of pairs, compute log-probabilities of chosen and rejected responses under both the current policy and the frozen reference, plug into the formula above, backprop. No reward model, no PPO, no rollouts. A V100 can run this; PPO requires a cluster.
Merits

Dramatically simpler than PPO — one model to train, no rollouts, no reward model. Approximately matches PPO's quality while being orders of magnitude cheaper.

Stable training without RL's sensitivity to hyperparameters. Standard cross-entropy-flavored loss that any researcher can debug.

Has rapidly become the default preference-tuning method in open-source.

Demerits

The reference model is fixed during training, which can over-constrain the policy. Several extensions (DPOP, KTO) address this.

Can produce models that are overly confident on training preferences but degrade on out-of-distribution prompts. The implicit reward is not separately validated.

The β hyperparameter is critical and dataset-dependent.

§ 34KTO, ORPO, IPO — the DPO family

DPO inspired an entire family of follow-ups, each addressing some weakness of the original.

KTO — Kahneman-Tversky Optimization

DPO requires preference pairs (a chosen and rejected response for the same prompt). KTO works with single labeled examples — each response is just marked "good" or "bad", no pair required. It uses prospect theory (Kahneman-Tversky) to model human utility asymmetrically: losses hurt more than equivalent gains. This is more data-efficient when you have abundant binary feedback rather than pairwise preferences.

LKTO = w · (1 − σ(β · (log πθ(y|x)/πref(y|x) − z0))) w depends on whether y is desirable or undesirable. z₀ is a reference value (the running KL).

ORPO — Odds Ratio Preference Optimization

ORPO combines SFT and preference optimization into a single stage. The loss is SFT plus a small odds-ratio penalty that pushes the model away from dispreferred responses. By doing both at once, ORPO skips the SFT-then-DPO two-stage pipeline entirely.

LORPO = LSFT(yw) − λ · log σ(log(p(yw)/(1−p(yw))) − log(p(yl)/(1−p(yl)))) Combines next-token CE on the winner with an odds-ratio penalty against the loser.

IPO — Identity Preference Optimization

DPO's loss has been shown to overfit on preference pairs — it can push toward extreme probability assignments. IPO replaces the log-sigmoid with a squared difference, producing a "smoother" loss that's more resistant to overfitting on noisy preference data.

LIPO = (log(πθ(yw)/πref(yw)) − log(πθ(yl)/πref(yl)) − 1/(2β))² Quadratic penalty instead of log-sigmoid; bounded gradient prevents preference-fitting blowup.
Collective strengths

The family offers a spectrum of trade-offs: KTO for binary feedback, ORPO for unified training, IPO for noisy preferences. All inherit DPO's simplicity over PPO.

Easier to ablate and compare than RL methods, accelerating empirical progress in alignment.

Collective weaknesses

No single winner has emerged. Empirical performance varies by dataset, base model, and evaluation method.

All inherit DPO's fundamental limitation: a frozen reference policy. Long training runs can hit a wall where the reference is too far from where the policy needs to go.

Demo 06 · DPO vs IPO loss shape
β0.1
controls how aggressively to depart from reference
Variant
DPO loss IPO loss Implicit reward margin = 0
Group E
Task recipes
Picking the right activation + loss combination for the task at hand. The four recipes most supervised projects come back to.

§ 35Binary classification

When the target is a single 0/1 label — spam or not spam, fraud or not, click or no click, malignant or benign — the standard recipe is one output logit, a sigmoid activation, and binary cross-entropy loss. The model produces a real number; sigmoid squashes it into (0, 1); BCE penalizes the gap between that probability and the true label.

Output: 1 logit  ·  Activation: σ(z)  ·  Loss: BCE(ŷ, y) Gradient through sigmoid + BCE simplifies to (ŷ − y) — clean, non-vanishing.
Examples Spam detection Fraud / anomaly CTR prediction Medical diagnosis (single condition) Churn prediction
Why this combination works
The gradient of BCE with respect to the logit is exactly ŷ − y. No saturation, no vanishing — even when the model is wildly wrong, the gradient is bounded and points in the right direction. This pairing is so robust that it's the default for every binary classifier in modern deep learning.
# PyTorch — use the with-logits version for numerical stability logits = model(x) # shape [B, 1] or [B] loss = F.binary_cross_entropy_with_logits(logits, target.float()) # For class imbalance, use pos_weight (scalar): pos_weight = torch.tensor([num_neg / num_pos]) loss = F.binary_cross_entropy_with_logits(logits, target.float(), pos_weight=pos_weight)
Variations

Class imbalance: pass pos_weight to upweight rare positives. For severe imbalance (> 100:1), switch to focal loss instead.

Calibration: after training, fit a temperature scalar on a validation set to correct over- or under-confidence.

Common pitfalls

Don't use softmax over 2 classes. It doubles the output parameters for no benefit and produces identical predictions.

Don't apply sigmoid before BCE manually — use the fused _with_logits version. Manual sigmoid + log(BCE) underflows near 0 and 1.

§ 36Multi-class classification (one label per example)

When each example has exactly one label out of K mutually exclusive options — digit recognition out of 10, ImageNet out of 1,000, sentiment out of 5 — the recipe is K output logits, a softmax activation, and categorical cross-entropy. The softmax forces the K probabilities to sum to 1, so the model is implicitly told "exactly one of these is true."

Output: K logits  ·  Activation: softmax(z)  ·  Loss: −log ŷtrue Gradient through softmax + CE is softmax(z) − onehot(y) — clean for every K.
Examples MNIST, ImageNet Sentiment (5-class) Next-token prediction Language identification Action selection in RL
Why this combination works
Softmax couples the K logits together — pushing one up automatically pulls the others down through the normalization. Combined with cross-entropy, the gradient is the difference between predicted and target distributions: clean, well-scaled, and stable across vocabulary sizes from 2 to 100,000.
# PyTorch's cross_entropy fuses log-softmax + NLL for stability. # Pass raw logits and INTEGER class indices (not one-hot). logits = model(x) # shape [B, K] loss = F.cross_entropy(logits, target_class_idx) # target shape [B], dtype long # With label smoothing (now built-in in PyTorch >= 1.10): loss = F.cross_entropy(logits, target_class_idx, label_smoothing=0.1) # With class weights for imbalance: weights = torch.tensor([1.0, 2.5, 1.0, 3.0, 1.0]) loss = F.cross_entropy(logits, target_class_idx, weight=weights)
Variations

Label smoothing (ε = 0.1 is typical) improves calibration and slightly hurts top-1 accuracy. Standard in transformer training.

Mixup / CutMix blend pairs of inputs and their targets; soft targets become natural and the model learns smoother decision boundaries.

Distillation replaces hard labels with a teacher model's soft probability distribution — see §29.

Common pitfalls

Don't use this when labels aren't exclusive. If a single example can have multiple labels (image with both "cat" and "dog"), softmax forces them to compete. You want multi-label (§37) instead.

Don't apply softmax before cross_entropy manually. Use raw logits — PyTorch's cross_entropy applies log-softmax internally with numerical safety.

§ 37Multi-label classification

When each example can carry any subset of K labels — a photo simultaneously tagged "beach", "sunset", and "palm tree"; a paper that is both "ML" and "optimization"; a movie spanning multiple Netflix genres — the recipe is K independent logits, a sigmoid per class, and binary cross-entropy summed across classes. The crucial difference from multi-class is that labels do not compete: the model can be fully confident in three labels at once without that confidence having to come from somewhere.

Output: K independent logits  ·  Activation: σ(zk) per class
Loss: Σk BCE(ŷk, yk)  [averaged across K and batch] This is K independent binary classifiers sharing a feature extractor. NOT softmax.
Examples Image tagging (Flickr, Pinterest) Document categorization Genre prediction Audio event detection Attribute prediction (CelebA)
Why sigmoid and not softmax
If a photo really shows both a beach and a sunset, softmax would force the model to split its confidence between the two — being more confident about "beach" would mean being less confident about "sunset". With independent sigmoids, the model can freely assert both. This is the single most important distinction between multi-class and multi-label.
# Target is a multi-hot vector, e.g. [1, 0, 1, 1, 0] for K=5 logits = model(x) # shape [B, K] loss = F.binary_cross_entropy_with_logits(logits, target_multihot.float()) # With per-class imbalance (each label can be rare independently): pos_weight = torch.tensor([3.0, 1.0, 12.0, 2.5, 1.0]) # [K] loss = F.binary_cross_entropy_with_logits(logits, target_multihot.float(), pos_weight=pos_weight) # At inference, threshold each sigmoid output independently: preds = (torch.sigmoid(logits) > 0.5).long() # shape [B, K] of 0/1
Variations

Asymmetric loss (ASL): different exponents for positive and negative class terms — handles severe per-label imbalance better than vanilla BCE. Used in ImageNet-21k multi-label setup.

Focal loss per class: apply the focal modulation (§19) to each sigmoid output independently.

Hierarchical labels: if the K labels form a tree (genre → sub-genre), use hierarchical softmax instead of flat BCE.

Common pitfalls

The #1 multi-label mistake: using softmax + categorical CE. It silently makes labels compete, hurts performance noticeably, and the bug is invisible until you look at the math.

Per-class thresholding: using 0.5 for every class is rarely optimal. Tune one threshold per class on a validation set, especially under imbalance.

Metrics: accuracy is meaningless here. Use macro/micro F1, mean average precision, or per-class AUC.

§ 38Regression

When the target is a continuous real number — house price, temperature, age estimation, predicted return — the recipe is one scalar output (or D scalars for multi-target regression), no activation at all on the output (just a linear layer), and one of the regression losses from §13–§16. The choice between MSE, MAE, and Huber comes down to how outlier-prone the target distribution is.

Output: 1 (or D) scalar  ·  Activation: none (linear / identity)
Loss: MSE (default)  |  MAE (robust)  |  Huber (compromise)  |  Quantile (distributional) Pick the loss based on the outlier characteristics of your target distribution.
Examples House price prediction Age from photo Demand forecasting Bounding box coordinates Reward prediction in RL
Choosing among regression losses
MSE when residuals are roughly Gaussian and outliers aren't a concern — it's the maximum likelihood estimator under that assumption. MAE when outliers are present and you want the model to predict the conditional median rather than the mean. Huber (smooth L1) when you want quadratic behavior near zero (clean gradients) but robust linear behavior for outliers. Quantile when you need prediction intervals, not point estimates.
# Standard regression — no activation, MSE loss pred = model(x) # shape [B] or [B, D] loss = F.mse_loss(pred, target) # Robust variants loss = F.l1_loss(pred, target) # MAE loss = F.smooth_l1_loss(pred, target) # Huber with δ=1 loss = F.huber_loss(pred, target, delta=1.5) # Huber with custom δ # Critical: normalize targets before training, denormalize at inference y_mean, y_std = train_targets.mean(), train_targets.std() target_norm = (target - y_mean) / y_std # ... train on target_norm ... pred_real = pred * y_std + y_mean # denormalize for evaluation
Variations

Bounded target ∈ [0, 1]: apply sigmoid to the output and use BCE — gives calibrated probabilistic regression for free. Works as well as linear + MSE and often better.

Strictly positive target: predict log(target) with linear output and MSE, then exponentiate at inference. Or use softplus output activation.

Heteroscedastic noise: predict (μ, log σ) pair, optimize the Gaussian negative log-likelihood. The model learns to be uncertain where the data is noisy.

Quantile regression: train K parallel heads at different quantiles to get full prediction intervals — see §16.

Common pitfalls

Not normalizing targets. A model trained on prices in dollars vs. thousands-of-dollars produces wildly different gradient scales for the same data. Always standardize the target.

Using MSE with heavy-tailed targets. A few extreme outliers can dominate training; switch to MAE or Huber, or apply a log/Box-Cox transform first.

Forgetting that R² can go negative. On held-out data, a model worse than predicting the training mean has R² < 0. Track MAE/RMSE as well for an unambiguous error measure.

The decision table

The full recipe at a glance, for the four tasks that cover the vast majority of supervised deep learning:

Task Target shape Output activation Loss function
Binary classification scalar in {0, 1} σ(z) — sigmoid BCE = −[y log ŷ + (1−y) log(1−ŷ)]
Multi-class (one label) int in {0, ..., K−1}
or one-hot
softmax(z) ∈ ΔK−1 Categorical CE = −log ŷtrue
Multi-label multi-hot vector
in {0, 1}K
σ(zk) per class
(independent, NOT softmax)
Σk BCE(ŷk, yk)
Regression real number(s)
in ℝ (or constrained range)
none (linear)
σ if bounded [0,1]
softplus if > 0
MSE / MAE / Huber / Quantile
Imbalanced classification as above, but rare class σ or softmax as appropriate Focal loss, or weighted BCE/CE
Sequence (next token) int sequence over V tokens softmax over V Σt categorical CE, masked to response
Embedding learning pairs / triplets / batches L2-normalize embedding Triplet, contrastive, or InfoNCE

One last principle. The output activation and the loss are a matched pair — they're designed to combine cleanly in the gradient. Sigmoid + BCE gives a gradient of ŷ − y. Softmax + categorical CE gives a gradient of softmax(z) − onehot(y). Mixing components from different recipes (e.g., softmax + MSE, or sigmoid + categorical CE) almost always either silently slows training or fundamentally encodes the wrong objective. Pick the recipe that matches your task, then resist the urge to deviate.

§ 39Loss function reference

Name Formula Use case
MSE / L2(1/n) Σ(y − ŷ)²Default regression; Gaussian noise assumption
MAE / L1(1/n) Σ|y − ŷ|Robust regression; outlier-prone targets
Huberquadratic for |r|≤δ, linear elseBounding-box regression; RL value targets
Quantilemax(τr, (τ−1)r)Prediction intervals; probabilistic forecasts
Binary CE−[y log ŷ + (1−y) log(1−ŷ)]Binary classification; multi-label
Categorical CE−log ŷtrueMulti-class classification; default deep classification
Focal−(1−pₜ)γ log pₜSeverely imbalanced classification; dense detection
Hingemax(0, 1 − y·ŷ)SVMs; margin-based classification
KL divergenceΣ p log(p/q)VAE; distillation; RL policy regularization
Label smoothingCE with target = (1−ε) onehot + ε/KImage classification; transformer training
Tripletmax(0, d(a,p) − d(a,n) + α)Face recognition; embedding learning
Contrastive (pair)y·d² + (1−y)·max(0, m−d)²Siamese networks; signature verification
InfoNCE−log(exp(sim+/τ) / Σexp(sim/τ))CLIP; SimCLR; sentence embeddings; modern retrieval
Next-token CE−Σ log p(xt|x<t)LLM pretraining; autoregressive generation
Masked LM−Σmasked log p(xi|context)BERT-family encoder pretraining
Distillationα·CE + (1−α)·T²·KLDistilBERT; smaller LLM variants; mobile models
SFTnext-token CE on response tokens onlyInstruction tuning; first stage of alignment
Reward model (BT)−log σ(r(yw) − r(yl))Training reward models for RLHF
PPOclipped policy gradient + KL penaltyRLHF alignment (ChatGPT, LLaMA-Chat era)
DPO−log σ(β·log π/πref ratio difference)Modern alignment; preference tuning without RL
KTOprospect-theoretic single-label preferenceBinary good/bad feedback; unbalanced label data
ORPOSFT + odds-ratio penaltySingle-stage instruction + preference tuning
IPOsquared margin (smoother DPO)Noisy preference data; long training runs

A final structural observation. The history of losses is a history of asking better questions. MSE asks "how far off?" Cross-entropy asks "how confident, in which direction?" Triplet asks "is this closer than that?" InfoNCE asks "which one among N is the match?" DPO asks "which response do you prefer?" Each shift unlocked new capabilities — robust regression, calibrated classification, learned embeddings, multimodal alignment, instruction following. Pick the loss that asks the question you actually want answered.