why this operation Integer IDs carry no geometric relationships. Embedding maps them into continuous space where semantically similar words cluster, enabling smooth gradient-based learning.
Each dimension is a sinusoid of a different wavelength (geometric from \(2\pi\) to \(2\pi\cdot10000\)). Early dims wiggle fast, later dims slowly — together a unique multi-scale fingerprint per position.
why addition (not concatenation) Addition is a geometric shift — it preserves the embedding's semantic content while sliding each token into a unique position-dependent region. Concatenation would double \(d_{\text{model}}\), wasting parameters and breaking residual shape constraints.
deeper: why sinusoids — relative position is linear
Let \(\omega = 10000^{-2i/d}\) be a frequency. For that dimension pair, shifting the position by a fixed offset \(k\) is a fixed rotation, independent of \(pos\):
$$\begin{bmatrix}\sin\omega(pos{+}k)\\ \cos\omega(pos{+}k)\end{bmatrix} = \begin{bmatrix}\cos\omega k & \sin\omega k\\ -\sin\omega k & \cos\omega k\end{bmatrix}\begin{bmatrix}\sin\omega\,pos\\ \cos\omega\,pos\end{bmatrix}.$$
So \(PE_{pos+k} = R_k\, PE_{pos}\) for a position-independent matrix \(R_k\). A linear attention projection can therefore learn to attend "k positions back" with a single weight pattern that works at every absolute position — which is exactly what generalizes to sequence lengths unseen in training.
This yields the Encoder Input Matrix \(X_{\text{enc}} \in \mathbb{R}^{3 \times d_{\text{model}}}\):
(\(K\) and \(V\) are computed identically with different weight matrices \(W^K, W^V\).)
Matrix
What it encodes
Analogy
Query (Q)
What this token is looking for
A search query: "I need context about…"
Key (K)
What this token advertises to others
A file label: "I contain information about…"
Value (V)
The actual content to retrieve if matched
The file contents returned when the label matches
why three separate projections A single matrix cannot serve all three roles. The Query encodes what's needed (task-dependent), the Key encodes what's available (content-dependent), the Value encodes what to transmit. Three independently learned matrices decouple "matching" (\(QK^\top\)) from "content retrieval" (\(V\)).
4. Scaled Dot-Product Attention (The Full Equation)
why this operation The dot product measures geometric alignment. If \(Q_{\text{cat}}\) and \(K_{\text{sat}}\) point similarly in the projected space, their dot product is large — "cat" should attend to "sat". A differentiable, parallel similarity search across the whole sequence at once.
So the scores have standard deviation \(\sqrt{d_k}\). Dividing by \(\sqrt{d_k}\) restores unit variance, keeping softmax out of its saturated tails (where gradients vanish). At \(d_k{=}64\) the unscaled scores would have std \(8\) — softmax would collapse to a near-one-hot, killing the gradient.
Each row sums to 1.0 (a probability distribution). Darker = more attention. "cat" puts 0.69 on itself; "sat" spreads onto "sat" (0.51) and its context.
why this operation Weighted information synthesis: "cat" becomes 69% its own content + 21% "sat" + 11% "The", absorbing context to form "cat-in-the-context-of-sitting". This is how Transformers build contextual representations without recurrence.
deeper: the softmax Jacobian (why this is trainable)
Backprop through attention needs \(\partial\,\text{softmax}/\partial z\). For \(s = \operatorname{softmax}(z)\):
where \(\delta_{ij}\) is the Kronecker delta. The Jacobian is largest when probabilities are spread (\(s_i \approx 1/n\)) and collapses to \(0\) when one entry saturates to \(1\) — which is precisely why the \(\sqrt{d_k}\) scaling above matters: it keeps \(s\) away from the one-hot corner where \(J \to 0\).
5. Multi-Head Attention
Instead of one attention computation, the model runs \(h\) parallel attention heads, each with its own \(Q,K,V\) projections on a slice of \(d_{\text{model}}\).
split\(d_k = d_{\text{model}} / h\) (e.g., \(512/8 = 64\) per head) per head\(\text{head}_i = \operatorname{Attention}(X W^Q_i,\, X W^K_i,\, X W^V_i)\) concat\(\operatorname{MultiHead} = \operatorname{Concat}(\text{head}_1, \ldots, \text{head}_h)\, W^O\) shapeseach head \((T, 64)\) → concat \((T, 512)\) → after \(W^O\) \((T, 512)\)
The dimension is sliced into \(h\) heads that attend independently (different relationship types), then concatenated and mixed by \(W^O\) — output shape equals input shape, ready for the residual add.
why multiple heads A single head learns one relationship type (e.g., syntactic agreement). Multiple heads can simultaneously track subject–verb links, adjective–noun pairs, positional proximity, etc. The concat + \(W^O\) merges these diverse signals into one representation.
Gradient highway: \(\partial Z/\partial X = I + \partial\text{Attn}/\partial X\). The identity \(I\) guarantees the gradient passes at full strength even if attention gradients vanish. A 12-layer Transformer has 24 sub-layers — without residuals, gradients decay through 24 nonlinearities.
Easier target: attention only learns the residual correction (what to add), not the full output.
Information preservation: if attention is garbage on some input, the original \(X\) passes through unharmed.
why LayerNorm after residual The residual sum can take an unpredictable magnitude. LayerNorm re-centers/re-scales each token vector to a stable distribution, preventing activation blow-up through stacked layers. The \(\epsilon\) (e.g., \(10^{-5}\)) inside the square root guards against division by zero when a token's features are nearly constant (\(\sigma^2 \to 0\)). Learned \(\gamma,\beta\) let the network recover any scale it needs — normalization is not a hard constraint.
why LayerNorm (not BatchNorm) Transformers process variable-length sequences. BatchNorm normalizes across the batch — meaningless when position 3 of a 5-token sentence is grouped with position 3 of a 100-token one. LayerNorm normalizes across features within each token, independent of batch size or sequence length.
7. The Feed-Forward Network (FFN)
Applied identically and independently to each token position after the attention residual block.
Unlike ReLU it is smooth (nonzero gradient for small negative \(x\)) and slightly negative near \(x\lesssim 0\), which empirically trains transformers better. The tanh form is the fast approximation used in practice.
Numeric Example \((d_{\text{model}}=4,\ d_{ff}=8)\)
why the FFN exists Attention mixes information between tokens but applies no nonlinear transform within a token. The FFN is a 2-layer MLP per position: where per-token feature transformation happens, where factual associations are stored in the weights, and where nonlinear reasoning occurs. Without it, the Transformer is just a linear weighted average.
The Second Residual Connection
residual add\(X_{\text{out}} = X_{\text{mid}} + \operatorname{FFN}(\operatorname{LayerNorm}(X_{\text{mid}}))\) numeric\([2.14, 1.42, -0.30, 2.15] + [0.45, -0.22, 0.81, 0.33] = [\mathbf{2.59, 1.20, 0.51, 2.48}]\) then LayerNorm→ zero-mean, unit-var → ready for the next layer
why a second residual here Same principle: the FFN only learns the correction. Each encoder layer has exactly 2 residual connections (around attention, around FFN) → 2 gradient shortcuts per layer.
8. Full Encoder Block — Complete Vector Trace
Two sub-blocks, each sub-layer → residual add. The dashed green skips are the gradient highways (\(I\) in \(\partial(x+F(x))/\partial x\)). Shown Pre-Norm (LN inside the residual), as in modern models.
Vector trace for "cat" (row 1) through one layer:
Input X[1]: [1.51, 0.87, -0.40, 1.89]
After attention A[1]: [0.63, 0.55, 0.10, 0.26]
After residual #1: [2.14, 1.42, -0.30, 2.15]
After LayerNorm: [0.76, 0.07, -1.59, 0.77]
After FFN: [0.45, -0.22, 0.81, 0.33]
After residual #2: [2.59, 1.20, 0.51, 2.48]
After LayerNorm: [0.52, -0.28, -0.71, 0.47] ← ready for layer 2
9. Causal Masking (Decoder Self-Attention)
During training the entire target sequence is fed at once. To prevent cheating (seeing future tokens), a mask is added to the attention scores before softmax.
why this operation At inference the decoder generates one token at a time — it genuinely has no future tokens. The mask simulates this during training while letting the GPU compute all positions in parallel. Without masking, the model would "cheat" by reading ahead, then fail at inference when future tokens don't exist.
10. Cross-Attention: Bridging Encoder to Decoder
In the decoder, after masked self-attention, each decoder token accesses the encoder's processed source representation. This is Cross-Attention.
source of \(Q\)decoder hidden states (after masked self-attn + residual) source of \(K, V\)encoder final output memory \(M\) equation\(\operatorname{CrossAttn} = \operatorname{softmax}\!\left(\dfrac{Q_{\text{dec}} K_{\text{enc}}^\top}{\sqrt{d_k}}\right) V_{\text{enc}}\) shapes\(Q_{\text{dec}}(T_d, d_k)\cdot K_{\text{enc}}^\top(d_k, T_e) \to (T_d, T_e) \cdot V_{\text{enc}}(T_e, d_k) \to (T_d, d_k)\)
why cross-attention is the pivot It is the ONLY mechanism bridging the two languages. The decoder's Spanish tokens can access the English input only through cross-attention; the weights implicitly learn word alignment with no explicit alignment labels — replacing the alignment model of earlier statistical MT.
11. Full Decoder Block — Complete Vector Trace
INPUT: Y = (T_dec, d_model) + Encoder Memory M = (T_enc, d_model)
══ SUB-BLOCK 1: MASKED SELF-ATTENTION + RESIDUAL ══
SA = MaskedSelfAttn(LayerNorm(Y)) → (T_dec, d_model) Y_1 = Y + SA → SKIP CONNECTION #1
[English: "The cat sat"]
│
▼
[Tokenize → Embed → Add Positional] ────────► X_enc: (3, 512)
│
▼
┌────────────────────────────────────────────────────────────┐
│ ENCODER (N=6 layers, each with 2 residual connections) │
│ Per layer: X = X + Attn(LN(X)) then X = X + FFN(LN(X)) │
│ Total: 12 skip connections in encoder │
└────────────────────────────────────────────────────────────┘
│
▼
[Encoder Memory M] ─────────────────────────► (3, 512) frozen during decode
│
▼
┌────────────────────────────────────────────────────────────┐
│ DECODER AUTOREGRESSIVE LOOP │
├────────────────────────────────────────────────────────────┤
│ ITER 1: input [<SOS>] │
│ → Masked Self-Attn + residual │
│ → Cross-Attn(Q=dec, K=M, V=M) + residual │
│ → FFN + residual → Linear → Softmax → predict "El" │
│ ITER 2: input [<SOS>, "El"] → predict "gato" │
│ ITER 3: input [<SOS>, "El", "gato"] → predict "se" │
│ ... until <EOS> predicted │
└────────────────────────────────────────────────────────────┘
│
▼
[Detokenize] ──────────────────────────────► "El gato se sentó"
Residual Connection Census
Component
Sub-layers per block
Residuals per block
×N layers
Total skips
Encoder
Self-Attn + FFN
2
×6
12
Decoder
Masked-Attn + Cross-Attn + FFN
3
×6
18
Total gradient highways in the model
30
14. Interview Depth Q&A
Question
Strong answer
Why residual connections?
Each sub-layer's gradient is \(\partial(x+F(x))/\partial x = I + \partial F/\partial x\). The \(I\) guarantees gradient ≥1 regardless of \(F\). Without it, 24 sub-layers vanish like deep pre-ResNet CNNs.
Why LayerNorm not BatchNorm?
Variable-length sequences make batch statistics meaningless across positions. LayerNorm normalizes per-token across features, batch- and length-independent.
What does \(\sqrt{d_k}\) prevent?
\(\operatorname{Var}(q\cdot k)=d_k\); large variance saturates softmax → near-zero gradient. Dividing by \(\sqrt{d_k}\) restores unit variance and the gradient-rich regime.
Softmax gradient?
\(\partial s_i/\partial z_j = s_i(\delta_{ij}-s_j)\); vanishes as \(s\) becomes one-hot — another reason for the \(\sqrt{d_k}\) scaling.
Why 4× FFN expansion?
Attention is linear in \(V\) (a weighted average). FFN adds nonlinear per-token computation in a higher-dimensional space before projecting back.
How does the causal mask enable parallel training?
All positions compute at once via matrix ops; the mask makes position \(t\) see only \(\le t\), simulating autoregression. Training depth is \(O(1)\), not \(O(T)\).
Where is knowledge stored?
Primarily in FFN weights (\(W_1, W_2\)); attention learns routing. "Knowledge neurons" are found in FFN layers.
Pre-Norm vs Post-Norm?
Post-Norm \(\operatorname{LN}(x+F(x))\) (original paper) sends the gradient through the LN Jacobian; Pre-Norm \(x+F(\operatorname{LN}(x))\) gives a clean identity shortcut, easier to train at scale (GPT, Llama).