The Transformer — Architecture & Mathematics

Notebook excerpts

A plain-text scan of every section in this note — the interactive, fully-styled version is in the reader above. Use whichever helps.

  1. 01

    2. Text Representation: From Words to Matrices

    Each token ID indexes a row from the learned weight matrix \(W_{\text{emb}} \in \mathbb{R}^{|V| \times d_{\text{model}}}\).

  2. 02

    3. Deep Dive: Q, K, V Projections

    The input matrix \(X\) is projected into three distinct semantic spaces through learned weight matrices:

  3. 03

    4. Scaled Dot-Product Attention (The Full Equation)

    Assume each component of \(q,k\) is independent with mean \(0\) and variance \(1\). Then the score is a sum of \(d_k\) independent products:

  4. 04

    5. Multi-Head Attention

    Instead of one attention computation, the model runs \(h\) parallel attention heads , each with its own \(Q,K,V\) projections on a slice of \(d_{\text{model}}\).

  5. 05

    6. The Residual Connection + Layer Normalization

    After every sub-layer (attention or FFN), the Transformer applies a skip connection (residual add) then Layer Normalization .

  6. 06

    7. The Feed-Forward Network (FFN)

    Applied identically and independently to each token position after the attention residual block.

  7. 07

    9. Causal Masking (Decoder Self-Attention)

    During training the entire target sequence is fed at once. To prevent cheating (seeing future tokens), a mask is added to the attention scores before softmax.

  8. 08

    10. Cross-Attention: Bridging Encoder to Decoder

    In the decoder, after masked self-attention, each decoder token accesses the encoder's processed source representation. This is Cross-Attention .