Activation & Loss Functions, Visualized

Notebook excerpts

A plain-text scan of every section in this note — the interactive, fully-styled version is in the reader above. Use whichever helps.

01
§ 01 Why activations exist
A linear layer computes y = Wx + b . Stack two of them: y = W₂(W₁x + b₁) + b₂ = (W₂W₁)x + (W₂b₁ + b₂) . Two linear layers collapse into one. Three collapse into one. A thousand collapse into one. No matter how deep you make a purely linear network, it can only represent linear functions.
02
§ 03 Tanh — hyperbolic tangent
Tanh is sigmoid's zero-centered cousin. It maps real numbers to (−1, 1) instead of (0, 1), preserving sign information and centering outputs on zero. For decades this was the default activation in hidden layers, because the centered output produces more balanced gradients and faster convergence than sigmoid.
03
§ 04 ReLU — rectified linear unit
The activation that made deep learning practical. ReLU is shockingly simple — output the input if positive, zero otherwise — but this simplicity changed everything. For the positive half, the derivative is exactly 1, so gradients flow through deep stacks without attenuation. Training networks with 50, 100, 500 layers became feasible.
04
§ 05 Leaky ReLU and PReLU
Leaky ReLU was the first attempt to address dying neurons. Instead of zeroing negative inputs entirely, it lets a small fraction through — typically 0.01. The negative side still has a gradient, so dead neurons can come back to life. PReLU (Parametric ReLU) takes this one step further by making the slope a learnable parameter, optimized jointly with the network weights.
05
§ 06 ELU and SELU
ELU (Exponential Linear Unit) uses an exponential curve on the negative side, smoothly approaching −α as x → −∞ . This is bounded — unlike Leaky ReLU's unbounded negative side — and smooth at zero. SELU (Scaled ELU) carefully chooses α and a scaling factor λ such that activations through deep networks self-normalize to zero mean and unit variance, removing the need for batch normalization.
06
§ 07 GELU — Gaussian Error Linear Unit
GELU is where activation functions enter the era of large language models. Introduced in 2016 and adopted by BERT, GPT-2, GPT-3, and most transformer encoders, it's a smooth nonlinearity that weights inputs by their value under the standard normal CDF. The intuition is that GELU multiplies x by the probability that a Gaussian draw exceeds zero — a probabilistic version of ReLU's hard gate.
07
§ 08 Swish (SiLU)
Discovered by Google's neural architecture search in 2017, Swish is even simpler than GELU but behaves similarly. It's just x · σ(x) — the input gated by its own sigmoid. PyTorch calls the same function SiLU (Sigmoid Linear Unit). The two names refer to the same function, with Swish sometimes generalized to include a learnable parameter β: x · σ(βx) .
08
§ 09 Mish
Mish takes the Swish idea further by replacing the sigmoid gate with a softplus-and-tanh composition. The result is a smooth, non-monotonic curve similar to Swish but with slightly different behavior in the negative tail. Mish gained attention through the YOLOv4 object detection paper, where replacing Leaky ReLU with Mish gave consistent improvements.
09
§ 11 GLU, SwiGLU, GeGLU — gated linear units
These aren't activation functions in the strict sense — they're entire layer architectures. A Gated Linear Unit splits a linear projection into two halves, applies an activation to one half, and element-wise multiplies them together. The "gate" decides how much of the other half to let through. SwiGLU (Swish-GLU) and GeGLU (GELU-GLU) are the most successful variants and now dominate state-of-the-art LLM feedforward layers.
10
§ 13 Mean squared error (MSE / L2)
The default loss for regression. MSE penalizes the squared distance between prediction and target, then averages across the dataset. The squaring has two effects: it makes the loss always positive, and it punishes large errors disproportionately — an error of 4 contributes 16 to the loss, while an error of 1 contributes 1.
11
§ 14 Mean absolute error (MAE / L1)
MAE is the natural alternative to MSE. Instead of squaring residuals, it takes their absolute value. This makes every error contribute linearly to the loss — an error of 4 contributes 4, not 16. The practical consequence is robustness: a single outlier carries the same weight as any other point, instead of dominating the training signal.
12
§ 15 Huber loss (smooth L1)
Huber is the compromise. It's quadratic for small errors (like MSE) and linear for large errors (like MAE). A threshold parameter δ controls where the transition happens. The result is a loss that's smooth at zero, has bounded gradient magnitude for outliers, and behaves like MSE near the optimum — getting the best of both worlds.
13
§ 16 Quantile (pinball) loss
The losses above all predict the mean (MSE) or the median (MAE). Quantile loss generalizes this — train with quantile τ and the network learns to predict the τ-th quantile of the conditional distribution. Predict τ = 0.5 and you recover MAE (median regression); predict τ = 0.9 and you get the upper 90th percentile. This is essential for probabilistic forecasting.
14
§ 17 Binary cross-entropy
The loss function that powers logistic regression and every binary neural classifier. BCE measures the negative log-likelihood of the true label under the predicted probability. If you're confident and right, the loss is near zero. If you're confident and wrong, the loss explodes — that's the asymmetry that makes it so effective.
15
§ 18 Categorical cross-entropy
The multi-class generalization of BCE. The model produces a softmax distribution over K classes; cross-entropy measures the negative log-probability assigned to the true class. Only the probability assigned to the correct class matters — the rest of the distribution doesn't affect the loss directly (though it affects it indirectly through the softmax normalization).
16
§ 19 Focal loss
Focal loss was introduced for dense object detection, where the foreground/background imbalance is extreme — roughly 1000:1. Standard cross-entropy was failing because the gradient signal was dominated by the vast number of easy background examples. Focal loss multiplies cross-entropy by a factor (1 − pₜ) γ that down-weights easy examples (where the model is already correct) and up-weights hard examples (where the model is wrong or unsure).
17
§ 20 Hinge loss
The loss that defines support vector machines. Hinge loss treats classification as a margin-maximization problem rather than a probability problem. As long as the model's score for the true class beats every other class's score by at least 1, the loss is zero. If a wrong class comes within 1 of the right class, the loss starts to grow linearly.
18
§ 21 KL divergence
Cross-entropy measures the loss between a prediction and a hard label. KL divergence generalizes this to measure the "distance" between two full probability distributions. It's not symmetric — KL(P‖Q) is not the same as KL(Q‖P) — but it's zero exactly when the two distributions match, and grows when they differ.
19
§ 22 Label smoothing
Label smoothing isn't a new loss function — it's a modification of the targets used inside cross-entropy. Instead of using one-hot labels (1 for the true class, 0 for all others), label smoothing uses 1 − ε for the true class and ε / (K − 1) for each other class. The model is no longer asked to be 100% confident in any prediction.
20
§ 23 Triplet loss
The classic embedding-learning loss. A triplet consists of an anchor, a positive example (similar to the anchor) and a negative (dissimilar). The loss pushes the anchor-positive distance below the anchor-negative distance by at least a margin α. After training, similar items cluster together in embedding space and dissimilar ones are far apart.
21
§ 24 Contrastive loss (pairwise)
The predecessor to triplet loss. Operates on pairs labeled as "similar" or "dissimilar". For similar pairs, minimize distance directly. For dissimilar pairs, push apart up to a margin — beyond that, contribute no gradient.
22
§ 25 InfoNCE — the modern contrastive loss
InfoNCE (Noise Contrastive Estimation) is what made contrastive learning explode in 2020. Instead of working with single triplets, it treats one batch as a giant classification problem: given an anchor, identify the correct positive among N − 1 negatives. The "loss" is just multi-class cross-entropy where the classes are the items in the batch and the logits are similarities.
23
§ 26 Next-token cross-entropy — the pretraining loss
The objective that powers every autoregressive language model. At each position in a sequence, the model predicts a probability distribution over the vocabulary for what comes next. The loss is cross-entropy between that distribution and the actual next token. Average across the entire sequence, average across the entire dataset, and you have the pretraining loss for GPT-2, GPT-3, GPT-4, LLaMA, Mistral, and every other autoregressive LLM.
24
§ 27 Masked language modeling loss (BERT)
BERT took a different approach. Instead of predicting the next token, it randomly masks 15% of tokens and asks the model to recover them, conditioning on the rest of the sequence — both left and right context. This is what makes BERT bidirectional, and what makes it good at understanding tasks rather than generation tasks.
25
§ 28 Perplexity
Perplexity isn't a training loss — it's the most common evaluation metric for language models, and it's just the exponential of the cross-entropy loss. If cross-entropy is "average bits per token of surprise" (in natural log units), perplexity is "effective vocabulary size the model is choosing among at each step." Lower is better.
26
§ 29 Knowledge distillation loss
Distillation transfers knowledge from a large "teacher" model to a smaller "student" model. The student learns to match the teacher's softmax distribution, not just the hard target labels. The teacher's "dark knowledge" — the relative probabilities it assigns to non-target classes — turns out to be more informative than the labels alone.
27
§ 30 SFT — supervised fine-tuning
After pretraining produces a base language model that knows how to predict text, supervised fine-tuning teaches it to follow instructions. The data is curated: humans write (or curate) example prompts and ideal responses. The loss is the same next-token cross-entropy as pretraining, but applied only to the response tokens, not the prompt.
28
§ 31 Reward model loss (Bradley-Terry)
To go beyond SFT, modern alignment uses human preferences. Annotators are shown two model responses to the same prompt and asked which is better. The reward model is trained to assign higher scalar scores to preferred responses than to dispreferred ones. The loss is derived from the Bradley-Terry preference model — under which the probability that response y₁ is preferred over y₂ is the sigmoid of their reward difference.
29
§ 32 PPO — Proximal Policy Optimization (RLHF)
With a reward model in hand, the next step is to fine-tune the language model to produce responses that score high under the reward model — without drifting too far from the original SFT model and losing fluency. PPO is the policy gradient algorithm that achieves this. The objective combines the expected reward with a KL penalty against a frozen reference (the SFT model) to prevent reward hacking.
30
§ 33 DPO — Direct Preference Optimization
The breakthrough paper of 2023. DPO showed that PPO's three-stage process (train reward model → run PPO with KL penalty against reference) is mathematically equivalent to a single supervised loss directly on preference pairs. The reward model is implicit, encoded in the policy itself. No reward training, no rollouts, no PPO — just a single-stage loss almost as simple as SFT.
31
§ 34 KTO, ORPO, IPO — the DPO family
DPO inspired an entire family of follow-ups, each addressing some weakness of the original.
32
§ 39 Loss function reference
A final structural observation. The history of losses is a history of asking better questions. MSE asks "how far off?" Cross-entropy asks "how confident, in which direction?" Triplet asks "is this closer than that?" InfoNCE asks "which one among N is the match?" DPO asks "which response do you prefer?" Each shift unlocked new capabilities — robust regression, calibrated classification, learned embeddings, multimodal alignment, instruction following. Pick the loss that asks the question you actually want answered.