Notebook excerpts
A plain-text scan of every section in this note — the interactive, fully-styled version is in the reader above. Use whichever helps.
01
2. Text Representation: From Words to Matrices
Each token ID indexes a row from the learned weight matrix \(W_{\text{emb}} \in \mathbb{R}^{|V| \times d_{\text{model}}}\).
02
3. Deep Dive: Q, K, V Projections
The input matrix \(X\) is projected into three distinct semantic spaces through learned weight matrices:
03
4. Scaled Dot-Product Attention (The Full Equation)
Assume each component of \(q,k\) is independent with mean \(0\) and variance \(1\). Then the score is a sum of \(d_k\) independent products:
04
5. Multi-Head Attention
Instead of one attention computation, the model runs \(h\) parallel attention heads , each with its own \(Q,K,V\) projections on a slice of \(d_{\text{model}}\).
05
6. The Residual Connection + Layer Normalization
After every sub-layer (attention or FFN), the Transformer applies a skip connection (residual add) then Layer Normalization .
06
7. The Feed-Forward Network (FFN)
Applied identically and independently to each token position after the attention residual block.
07
9. Causal Masking (Decoder Self-Attention)
During training the entire target sequence is fed at once. To prevent cheating (seeing future tokens), a mask is added to the attention scores before softmax.
08
10. Cross-Attention: Bridging Encoder to Decoder
In the decoder, after masked self-attention, each decoder token accesses the encoder's processed source representation. This is Cross-Attention .