The Transformer — Architecture & Mathematics

Notebook excerpts

A plain-text scan of every section in this note — the interactive, fully-styled version is in the reader above. Use whichever helps.

01
2. Text Representation: From Words to Matrices
Each token ID indexes a row from the learned weight matrix W_emb ∈ ℝ^|V| × d_model .
02
3. Deep Dive: Q, K, V Projections
The input matrix X is projected into three distinct semantic spaces through learned weight matrices:
03
4. Scaled Dot-Product Attention (The Full Equation)
Assume each component of q,k is independent with mean 0 and variance 1 . Then the score is a sum of d_k independent products:
04
5. Multi-Head Attention
Instead of one attention computation, the model runs h parallel attention heads , each with its own Q,K,V projections on a slice of d_model .
05
6. The Residual Connection + Layer Normalization
After every sub-layer (attention or FFN), the Transformer applies a skip connection (residual add) then Layer Normalization .
06
7. The Feed-Forward Network (FFN)
Applied identically and independently to each token position after the attention residual block.
07
9. Causal Masking (Decoder Self-Attention)
During training the entire target sequence is fed at once. To prevent cheating (seeing future tokens), a mask is added to the attention scores before softmax.
08
10. Cross-Attention: Bridging Encoder to Decoder
In the decoder, after masked self-attention, each decoder token accesses the encoder's processed source representation. This is Cross-Attention .