Comprehensive Master Notes: The Transformer

Architecture, Mathematics, and Flow — with numeric vector examples at every operation

Anchor example: "The cat sat" → "El gato se sentó"

Original paper: Vaswani, A. et al. (2017). Attention Is All You Need. NeurIPS 2017 (arXiv:1706.03762).

1. Architectural Taxonomy

Type	Attention dynamics	Purpose	Models
Encoder-Only	Bi-directional: every token sees all others	Contextual embedding extraction	BERT, RoBERTa
Decoder-Only	Causal: each token sees only past	Autoregressive text generation	GPT-4, Llama, Claude
Encoder-Decoder	Bi-dir encoder + causal decoder + cross-attn bridge	Sequence-to-sequence mapping	T5, BART, Whisper

2. Text Representation: From Words to Matrices

Step A: Tokenization & Vocabulary Mapping

source string"The cat sat" → ["The", "cat", "sat"] → token IDs $[42, 108, 311]$
target string"El gato se sentó" → ["<SOS>", "El", "gato", "se", "sentó"]

Step B: Token Embedding Lookup

Each token ID indexes a row from the learned weight matrix $W_{\text{emb}} \in \mathbb{R}^{|V| \times d_{\text{model}}}$.

operation$e_t = W_{\text{emb}}[\,\text{id}_t,\,:\,]$
numeric (d=4)$W_{\text{emb}}[42] = [0.21, -0.55, 0.83, 0.12]$ ← "The"
$W_{\text{emb}}[108] = [0.67, 0.33, -0.41, 0.89]$ ← "cat"
$W_{\text{emb}}[311] = [-0.12, 0.71, 0.56, -0.34]$ ← "sat"

why this operation
Integer IDs carry no geometric relationships. Embedding maps them into continuous space where semantically similar words cluster, enabling smooth gradient-based learning.

Step C: Positional Encoding Addition

sine / cosine$PE_{(pos,\,2i)} = \sin\!\big(pos / 10000^{\,2i/d}\big)$
$PE_{(pos,\,2i+1)} = \cos\!\big(pos / 10000^{\,2i/d}\big)$
position 0$PE_0 = [0.00, 1.00, 0.00, 1.00]$
position 1$PE_1 = [0.84, 0.54, 0.01, 1.00]$
position 2$PE_2 = [0.91, -0.42, 0.02, 1.00]$
final input (add)$X[0] = [0.21,-0.55,0.83,0.12] + [0.00,1.00,0.00,1.00] = [\mathbf{0.21,\,0.45,\,0.83,\,1.12}]$
$X[1] = [0.67,0.33,-0.41,0.89] + [0.84,0.54,0.01,1.00] = [\mathbf{1.51,\,0.87,\,-0.40,\,1.89}]$
$X[2] = [-0.12,0.71,0.56,-0.34] + [0.91,-0.42,0.02,1.00] = [\mathbf{0.79,\,0.29,\,0.58,\,0.66}]$

Each dimension is a sinusoid of a different wavelength (geometric from $2\pi$ to $2\pi\cdot10000$). Early dims wiggle fast, later dims slowly — together a unique multi-scale fingerprint per position.

why addition (not concatenation)
Addition is a geometric shift — it preserves the embedding's semantic content while sliding each token into a unique position-dependent region. Concatenation would double $d_{\text{model}}$, wasting parameters and breaking residual shape constraints.

deeper: why sinusoids — relative position is linear

Let $\omega = 10000^{-2i/d}$ be a frequency. For that dimension pair, shifting the position by a fixed offset $k$ is a fixed rotation, independent of $pos$:

$$\begin{bmatrix}\sin\omega(pos{+}k)\\ \cos\omega(pos{+}k)\end{bmatrix} = \begin{bmatrix}\cos\omega k & \sin\omega k\\ -\sin\omega k & \cos\omega k\end{bmatrix}\begin{bmatrix}\sin\omega\,pos\\ \cos\omega\,pos\end{bmatrix}.$$

So $PE_{pos+k} = R_k\, PE_{pos}$ for a position-independent matrix $R_k$. A linear attention projection can therefore learn to attend "k positions back" with a single weight pattern that works at every absolute position — which is exactly what generalizes to sequence lengths unseen in training.

This yields the Encoder Input Matrix $X_{\text{enc}} \in \mathbb{R}^{3 \times d_{\text{model}}}$:

$X_{\text{enc}} = \begin{bmatrix} 0.21 & 0.45 & 0.83 & 1.12 \\ 1.51 & 0.87 & -0.40 & 1.89 \\ 0.79 & 0.29 & 0.58 & 0.66 \end{bmatrix}$ rows = "The", "cat", "sat"

3. Deep Dive: Q, K, V Projections

The input matrix $X$ is projected into three distinct semantic spaces through learned weight matrices:

projections$Q = X W^Q, \qquad K = X W^K, \qquad V = X W^V$
shapes$X\!: (3{\times}4) \cdot W^Q\!: (4{\times}4) = Q\!: (3{\times}4)$

Numeric Example

$W^Q$ (learned)$\begin{bmatrix} 0.1 & 0.3 & -0.2 & 0.5 \\ 0.4 & -0.1 & 0.6 & 0.2 \\ -0.3 & 0.7 & 0.1 & -0.4 \\ 0.2 & 0.0 & 0.3 & 0.8 \end{bmatrix}$
$Q[0]$ ("The")$[0.21, 0.45, 0.83, 1.12]\, W^Q = [\mathbf{0.13, 0.61, 0.62, 0.68}]$
$Q[1]$ ("cat")$[1.51, 0.87, -0.40, 1.89]\, W^Q = [\mathbf{0.87, 0.34, 0.89, 2.27}]$
$Q[2]$ ("sat")$[0.79, 0.29, 0.58, 0.66]\, W^Q = [\mathbf{0.20, 0.67, 0.40, 0.64}]$

($K$ and $V$ are computed identically with different weight matrices $W^K, W^V$.)

Matrix	What it encodes	Analogy
Query (Q)	What this token is looking for	A search query: "I need context about…"
Key (K)	What this token advertises to others	A file label: "I contain information about…"
Value (V)	The actual content to retrieve if matched	The file contents returned when the label matches

why three separate projections
A single matrix cannot serve all three roles. The Query encodes what's needed (task-dependent), the Key encodes what's available (content-dependent), the Value encodes what to transmit. Three independently learned matrices decouple "matching" ($QK^\top$) from "content retrieval" ($V$).

4. Scaled Dot-Product Attention (The Full Equation)

equation$\displaystyle \operatorname{Attention}(Q,K,V) = \operatorname{softmax}\!\left(\frac{Q K^\top}{\sqrt{d_k}}\right) V$

Step 1: $Q K^\top$ (Similarity Matrix)

operation$(3{\times}4) \cdot (4{\times}3) = (3{\times}3)$ raw attention scores
numeric $Q K^\top$$\begin{bmatrix} 2.1 & 3.8 & 1.2 \\ 1.5 & 5.2 & 2.8 \\ 1.9 & 2.4 & 3.6 \end{bmatrix}$ row = query "The/cat/sat" attends to keys

why this operation
The dot product measures geometric alignment. If $Q_{\text{cat}}$ and $K_{\text{sat}}$ point similarly in the projected space, their dot product is large — "cat" should attend to "sat". A differentiable, parallel similarity search across the whole sequence at once.

Step 2: Divide by $\sqrt{d_k}$ (Scaling)

with $d_k = 4$$\sqrt{4} = 2.0$
scaled scores$\begin{bmatrix} 1.05 & 1.90 & 0.60 \\ 0.75 & 2.60 & 1.40 \\ 0.95 & 1.20 & 1.80 \end{bmatrix}$

deeper: where $\sqrt{d_k}$ comes from (variance argument)

Assume each component of $q,k$ is independent with mean $0$ and variance $1$. Then the score is a sum of $d_k$ independent products:

$$q\cdot k = \sum_{l=1}^{d_k} q_l k_l, \qquad \mathbb{E}[q\cdot k]=0, \qquad \operatorname{Var}(q\cdot k) = \sum_{l=1}^{d_k}\operatorname{Var}(q_l k_l) = d_k.$$

So the scores have standard deviation $\sqrt{d_k}$. Dividing by $\sqrt{d_k}$ restores unit variance, keeping softmax out of its saturated tails (where gradients vanish). At $d_k{=}64$ the unscaled scores would have std $8$ — softmax would collapse to a near-one-hot, killing the gradient.

Step 3: Softmax (Row-wise Normalization)

per row$\operatorname{softmax}(z)_i = \dfrac{e^{z_i}}{\sum_j e^{z_j}}$
row 0 ("The")$\exp([1.05,1.90,0.60]) = [2.86, 6.69, 1.82],\ \textstyle\sum = 11.37 \Rightarrow [\mathbf{0.25, 0.59, 0.16}]$
row 1 ("cat")$\exp([0.75,2.60,1.40]) = [2.12, 13.46, 4.06],\ \textstyle\sum = 19.64 \Rightarrow [\mathbf{0.11, 0.69, 0.21}]$
row 2 ("sat")$\exp([0.95,1.20,1.80]) = [2.59, 3.32, 6.05],\ \textstyle\sum = 11.96 \Rightarrow [\mathbf{0.22, 0.28, 0.51}]$

Each row sums to 1.0 (a probability distribution). Darker = more attention. "cat" puts 0.69 on itself; "sat" spreads onto "sat" (0.51) and its context.

Step 4: Multiply by Value Matrix $V$

operation$\text{weights}\,(3{\times}3) \cdot V\,(3{\times}4) = \text{output}\,(3{\times}4)$
$V$ (example)$V[\text{The}]{=}[0.50,-0.30,0.20,0.80],\ V[\text{cat}]{=}[0.90,0.60,-0.10,0.40],\ V[\text{sat}]{=}[-0.20,0.80,0.70,-0.50]$
output row "cat"$0.11\,V_{\text{The}} + 0.69\,V_{\text{cat}} + 0.21\,V_{\text{sat}}$
$= [0.055,-0.033,0.022,0.088] + [0.621,0.414,-0.069,0.276] + [-0.042,0.168,0.147,-0.105]$
$= [\mathbf{0.63, 0.55, 0.10, 0.26}]$

why this operation
Weighted information synthesis: "cat" becomes 69% its own content + 21% "sat" + 11% "The", absorbing context to form "cat-in-the-context-of-sitting". This is how Transformers build contextual representations without recurrence.

deeper: the softmax Jacobian (why this is trainable)

Backprop through attention needs $\partial\,\text{softmax}/\partial z$. For $s = \operatorname{softmax}(z)$:

$$\frac{\partial s_i}{\partial z_j} = s_i\,(\delta_{ij} - s_j), \qquad J = \operatorname{diag}(s) - s\,s^\top,$$

where $\delta_{ij}$ is the Kronecker delta. The Jacobian is largest when probabilities are spread ($s_i \approx 1/n$) and collapses to $0$ when one entry saturates to $1$ — which is precisely why the $\sqrt{d_k}$ scaling above matters: it keeps $s$ away from the one-hot corner where $J \to 0$.

5. Multi-Head Attention

Instead of one attention computation, the model runs $h$ parallel attention heads, each with its own $Q,K,V$ projections on a slice of $d_{\text{model}}$.

split$d_k = d_{\text{model}} / h$ (e.g., $512/8 = 64$ per head)
per head$\text{head}_i = \operatorname{Attention}(X W^Q_i,\, X W^K_i,\, X W^V_i)$
concat$\operatorname{MultiHead} = \operatorname{Concat}(\text{head}_1, \ldots, \text{head}_h)\, W^O$
shapeseach head $(T, 64)$ → concat $(T, 512)$ → after $W^O$ $(T, 512)$

The dimension is sliced into $h$ heads that attend independently (different relationship types), then concatenated and mixed by $W^O$ — output shape equals input shape, ready for the residual add.

why multiple heads
A single head learns one relationship type (e.g., syntactic agreement). Multiple heads can simultaneously track subject–verb links, adjective–noun pairs, positional proximity, etc. The concat + $W^O$ merges these diverse signals into one representation.

output shape guarantee
MultiHead output shape $= (T, d_{\text{model}})$, exactly matching $X$. Critical: residual addition next requires identical shapes.

6. The Residual Connection + Layer Normalization

After every sub-layer (attention or FFN), the Transformer applies a skip connection (residual add) then Layer Normalization.

The Residual Addition

equation$Z = X + \operatorname{MultiHeadAttn}(X)$
$X[1]$ (input "cat")$[1.51, 0.87, -0.40, 1.89]$
$\text{Attn}[1]$$[0.63, 0.55, 0.10, 0.26]$
$Z[1] = X[1] + \text{Attn}[1]$$[\mathbf{2.14, 1.42, -0.30, 2.15}]$ ← residual sum

why this operation

Gradient highway: $\partial Z/\partial X = I + \partial\text{Attn}/\partial X$. The identity $I$ guarantees the gradient passes at full strength even if attention gradients vanish. A 12-layer Transformer has 24 sub-layers — without residuals, gradients decay through 24 nonlinearities.
Easier target: attention only learns the residual correction (what to add), not the full output.
Information preservation: if attention is garbage on some input, the original $X$ passes through unharmed.

Layer Normalization

per token (row)$\mu = \dfrac1d\sum_j Z_{ij}, \qquad \sigma^2 = \dfrac1d\sum_j (Z_{ij}-\mu)^2$
normalize$\hat Z_{ij} = \dfrac{Z_{ij}-\mu}{\sqrt{\sigma^2 + \epsilon}}, \qquad \text{out}_{ij} = \gamma_j\,\hat Z_{ij} + \beta_j$
numeric $Z[1]$$[2.14, 1.42, -0.30, 2.15]$
mean$\mu = (2.14 + 1.42 - 0.30 + 2.15)/4 = 1.3525$
variance$\sigma^2 = \tfrac14\big[(0.79)^2+(0.07)^2+(-1.65)^2+(0.80)^2\big] \approx 1.08$
normalized$\hat Z[1] = [\mathbf{0.76,\ 0.07,\ -1.59,\ 0.77}]$ ($\sqrt{1.08}\approx1.04$; zero-mean, unit-var)

why LayerNorm after residual
The residual sum can take an unpredictable magnitude. LayerNorm re-centers/re-scales each token vector to a stable distribution, preventing activation blow-up through stacked layers. The $\epsilon$ (e.g., $10^{-5}$) inside the square root guards against division by zero when a token's features are nearly constant ($\sigma^2 \to 0$). Learned $\gamma,\beta$ let the network recover any scale it needs — normalization is not a hard constraint.

why LayerNorm (not BatchNorm)
Transformers process variable-length sequences. BatchNorm normalizes across the batch — meaningless when position 3 of a 5-token sentence is grouped with position 3 of a 100-token one. LayerNorm normalizes across features within each token, independent of batch size or sequence length.

7. The Feed-Forward Network (FFN)

Applied identically and independently to each token position after the attention residual block.

equation$\operatorname{FFN}(x) = \operatorname{GELU}(x W_1 + b_1)\,W_2 + b_2$
shapes$x\!:(d_{\text{model}}) \to W_1\!:(d_{\text{model}}{\times}d_{ff}) \to (d_{ff}) \to W_2\!:(d_{ff}{\times}d_{\text{model}}) \to (d_{\text{model}})$
typical$d_{\text{model}}=512,\ d_{ff}=2048$ (4× expansion)

deeper: what GELU actually is

GELU (Gaussian Error Linear Unit) gates an input by the probability a standard normal is below it — a smooth alternative to ReLU's hard gate:

$$\operatorname{GELU}(x) = x\,\Phi(x) = x\cdot \tfrac12\Big[1 + \operatorname{erf}\!\big(x/\sqrt2\big)\Big] \approx 0.5\,x\Big(1 + \tanh\!\big[\sqrt{2/\pi}\,(x + 0.044715\,x^3)\big]\Big).$$

Unlike ReLU it is smooth (nonzero gradient for small negative $x$) and slightly negative near $x\lesssim 0$, which empirically trains transformers better. The tanh form is the fast approximation used in practice.

Numeric Example $(d_{\text{model}}=4,\ d_{ff}=8)$

input (after LN)$x = [0.76, 0.07, -1.59, 0.77]$
expand $xW_1+b_1$$[0.3, -1.2, 0.8, 2.1, -0.4, 1.5, 0.0, -0.7]$ (8 dims)
GELU$[0.24, 0.0, 0.66, 2.08, 0.0, 1.44, 0.0, 0.0]$ (negatives ≈ zeroed)
project $\cdot W_2+b_2$$[\mathbf{0.45, -0.22, 0.81, 0.33}]$ (back to 4 dims)

why the FFN exists
Attention mixes information between tokens but applies no nonlinear transform within a token. The FFN is a 2-layer MLP per position: where per-token feature transformation happens, where factual associations are stored in the weights, and where nonlinear reasoning occurs. Without it, the Transformer is just a linear weighted average.

The Second Residual Connection

residual add$X_{\text{out}} = X_{\text{mid}} + \operatorname{FFN}(\operatorname{LayerNorm}(X_{\text{mid}}))$
numeric$[2.14, 1.42, -0.30, 2.15] + [0.45, -0.22, 0.81, 0.33] = [\mathbf{2.59, 1.20, 0.51, 2.48}]$
then LayerNorm→ zero-mean, unit-var → ready for the next layer

why a second residual here
Same principle: the FFN only learns the correction. Each encoder layer has exactly 2 residual connections (around attention, around FFN) → 2 gradient shortcuts per layer.

8. Full Encoder Block — Complete Vector Trace

Two sub-blocks, each sub-layer → residual add. The dashed green skips are the gradient highways ($I$ in $\partial(x+F(x))/\partial x$). Shown Pre-Norm (LN inside the residual), as in modern models.

sub-block 1$A = \operatorname{MHSA}(\operatorname{LN}(X)); \quad X_{\text{mid}} = X + A$ ← skip #1
sub-block 2$F = \operatorname{FFN}(\operatorname{LN}(X_{\text{mid}})); \quad X_{\text{out}} = X_{\text{mid}} + F$ ← skip #2

Vector trace for "cat" (row 1) through one layer:
  Input X[1]:          [1.51, 0.87, -0.40, 1.89]
  After attention A[1]: [0.63, 0.55, 0.10, 0.26]
  After residual #1:    [2.14, 1.42, -0.30, 2.15]
  After LayerNorm:      [0.76, 0.07, -1.59, 0.77]
  After FFN:            [0.45, -0.22, 0.81, 0.33]
  After residual #2:    [2.59, 1.20, 0.51, 2.48]
  After LayerNorm:      [0.52, -0.28, -0.71, 0.47] ← ready for layer 2

9. Causal Masking (Decoder Self-Attention)

During training the entire target sequence is fed at once. To prevent cheating (seeing future tokens), a mask is added to the attention scores before softmax.

mask matrix$M = \begin{bmatrix} 0 & -\infty & -\infty \\ 0 & 0 & -\infty \\ 0 & 0 & 0 \end{bmatrix}$ row $t$ sees only $\le t$
application$\text{masked} = \dfrac{QK^\top}{\sqrt{d_k}} + M$
effect$e^{-\infty} = 0$ exactly → future positions get zero attention weight
numeric (row "El")scores after mask $= [0.75, 2.60, -\infty] \Rightarrow \operatorname{softmax} = [0.14, 0.86, \mathbf{0.00}]$

why this operation
At inference the decoder generates one token at a time — it genuinely has no future tokens. The mask simulates this during training while letting the GPU compute all positions in parallel. Without masking, the model would "cheat" by reading ahead, then fail at inference when future tokens don't exist.

10. Cross-Attention: Bridging Encoder to Decoder

In the decoder, after masked self-attention, each decoder token accesses the encoder's processed source representation. This is Cross-Attention.

source of $Q$decoder hidden states (after masked self-attn + residual)
source of $K, V$encoder final output memory $M$
equation$\operatorname{CrossAttn} = \operatorname{softmax}\!\left(\dfrac{Q_{\text{dec}} K_{\text{enc}}^\top}{\sqrt{d_k}}\right) V_{\text{enc}}$
shapes$Q_{\text{dec}}(T_d, d_k)\cdot K_{\text{enc}}^\top(d_k, T_e) \to (T_d, T_e) \cdot V_{\text{enc}}(T_e, d_k) \to (T_d, d_k)$

Numeric: Decoder token "El" querying encoder

$Q_{\text{dec}}[\text{El}]$$[0.82, -0.15, 0.93, 0.44]$
$K_{\text{enc}}[\text{The}]$$[0.31, 0.72, 0.18, 0.55]$
$K_{\text{enc}}[\text{cat}]$$[0.88, 0.41, -0.20, 0.93]$
$K_{\text{enc}}[\text{sat}]$$[-0.15, 0.66, 0.71, -0.28]$
dot products$Q\cdot K_{\text{The}} = 0.56,\quad Q\cdot K_{\text{cat}} = 0.87,\quad Q\cdot K_{\text{sat}} = 0.31$
scale + softmax$[0.28, 0.43, 0.15] \Rightarrow \operatorname{softmax} = [\mathbf{0.32, 0.43, 0.25}]$
interpretation"El" attends most to "cat" (0.43) — learning source↔target alignment.

why cross-attention is the pivot
It is the ONLY mechanism bridging the two languages. The decoder's Spanish tokens can access the English input only through cross-attention; the weights implicitly learn word alignment with no explicit alignment labels — replacing the alignment model of earlier statistical MT.

11. Full Decoder Block — Complete Vector Trace

INPUT: Y = (T_dec, d_model) + Encoder Memory M = (T_enc, d_model)

══ SUB-BLOCK 1: MASKED SELF-ATTENTION + RESIDUAL ══
  SA = MaskedSelfAttn(LayerNorm(Y))      → (T_dec, d_model)
  Y_1 = Y + SA                         → SKIP CONNECTION #1

══ SUB-BLOCK 2: CROSS-ATTENTION + RESIDUAL ══
  CA = CrossAttn(Q=LayerNorm(Y_1), K=M, V=M) → (T_dec, d_model)
  Y_2 = Y_1 + CA                       → SKIP CONNECTION #2

══ SUB-BLOCK 3: FFN + RESIDUAL ══
  F = FFN(LayerNorm(Y_2))               → (T_dec, d_model)
  Y_out = Y_2 + F                     → SKIP CONNECTION #3

■ Each decoder layer has 3 residual connections
■ 6 decoder layers × 3 = 18 gradient highways in the decoder alone

12. Output Projection: From Vectors Back to Words

final LayerNorm$h = \operatorname{LayerNorm}(Y_{\text{out}}[-1])$ ← newest token row
linear projection$\text{logits} = h\,W_{\text{vocab}} + b, \quad (d_{\text{model}})\cdot(d_{\text{model}}{\times}|V|) \to (|V|)$
softmax$P(\text{token}_j) = \dfrac{e^{\text{logit}_j}}{\sum_k e^{\text{logit}_k}}$
argmax$\text{predicted\_id} = \arg\max_j P(\text{token}_j) \to$ detokenize $\to$ "El"

Numeric

$h$ (d=4)$[0.52, -0.28, -0.71, 0.47]$
$W_{\text{vocab}}$ (4×5)vocab = [the, cat, El, gato, sat]
logits$[0.3, -1.2, \mathbf{3.8}, 1.1, -0.5]$
softmax$[0.02, 0.01, \mathbf{0.82}, 0.05, 0.01]$
argmaxindex 2 → "El" ✓

13. End-to-End Operational Lifecycle

[English: "The cat sat"]
      │
      ▼
[Tokenize → Embed → Add Positional] ────────► X_enc: (3, 512)
      │
      ▼
┌────────────────────────────────────────────────────────────┐
│ ENCODER (N=6 layers, each with 2 residual connections) │
│ Per layer: X = X + Attn(LN(X)) then X = X + FFN(LN(X)) │
│ Total: 12 skip connections in encoder │
└────────────────────────────────────────────────────────────┘
      │
      ▼
[Encoder Memory M] ─────────────────────────► (3, 512) frozen during decode
      │
      ▼
┌────────────────────────────────────────────────────────────┐
│ DECODER AUTOREGRESSIVE LOOP │
├────────────────────────────────────────────────────────────┤
│ ITER 1: input [<SOS>] │
│  → Masked Self-Attn + residual │
│  → Cross-Attn(Q=dec, K=M, V=M) + residual │
│  → FFN + residual → Linear → Softmax → predict "El" │
│ ITER 2: input [<SOS>, "El"] → predict "gato" │
│ ITER 3: input [<SOS>, "El", "gato"] → predict "se" │
│ ... until <EOS> predicted │
└────────────────────────────────────────────────────────────┘
      │
      ▼
[Detokenize] ──────────────────────────────► "El gato se sentó"

Residual Connection Census

Component	Sub-layers per block	Residuals per block	×N layers	Total skips
Encoder	Self-Attn + FFN	2	×6	12
Decoder	Masked-Attn + Cross-Attn + FFN	3	×6	18
Total gradient highways in the model				30

14. Interview Depth Q&A

Question	Strong answer
Why residual connections?	Each sub-layer's gradient is $\partial(x+F(x))/\partial x = I + \partial F/\partial x$. The $I$ guarantees gradient ≥1 regardless of $F$. Without it, 24 sub-layers vanish like deep pre-ResNet CNNs.
Why LayerNorm not BatchNorm?	Variable-length sequences make batch statistics meaningless across positions. LayerNorm normalizes per-token across features, batch- and length-independent.
What does $\sqrt{d_k}$ prevent?	$\operatorname{Var}(q\cdot k)=d_k$; large variance saturates softmax → near-zero gradient. Dividing by $\sqrt{d_k}$ restores unit variance and the gradient-rich regime.
Softmax gradient?	$\partial s_i/\partial z_j = s_i(\delta_{ij}-s_j)$; vanishes as $s$ becomes one-hot — another reason for the $\sqrt{d_k}$ scaling.
Why 4× FFN expansion?	Attention is linear in $V$ (a weighted average). FFN adds nonlinear per-token computation in a higher-dimensional space before projecting back.
How does the causal mask enable parallel training?	All positions compute at once via matrix ops; the mask makes position $t$ see only $\le t$, simulating autoregression. Training depth is $O(1)$, not $O(T)$.
Where is knowledge stored?	Primarily in FFN weights ($W_1, W_2$); attention learns routing. "Knowledge neurons" are found in FFN layers.
Pre-Norm vs Post-Norm?	Post-Norm $\operatorname{LN}(x+F(x))$ (original paper) sends the gradient through the LN Jacobian; Pre-Norm $x+F(\operatorname{LN}(x))$ gives a clean identity shortcut, easier to train at scale (GPT, Llama).