Field Notes
Things I’m reading, deriving, building toward — kept here so I can find them again, and so you can read along.
Featured · May 18, 2026
Autoencoders & VAEs, Visualized
A working tour of every operation inside an autoencoder — encoder, bottleneck, decoder, distributions, the reparameterization trick, KL divergence, and the ELBO — built so each step can be watched and replayed.
- Deep Learning
- Generative
- VAE
- Representation Learning
02May 27, 2026· 40 min
The Transformer — Architecture & Mathematics
A from-scratch derivation of the Transformer with worked numeric vectors at every step — token + positional embeddings, scaled dot-product and multi-head attention (with the softmax Jacobian and the √dₖ variance argument), residual + LayerNorm, the GELU feed-forward, causal masking, and cross-attention, traced through full encoder and decoder blocks.
- Deep Learning
- Transformers
- Attention
03May 27, 2026· 30 min
CNN — Architecture & Mathematics
Convolutional networks end to end — the convolution arithmetic with explicit padding, its backward pass as a flipped-kernel convolution, ReLU and BatchNorm, pooling with the receptive-field recurrence, depthwise-separable factorization, and the global-average-pool + softmax classifier head.
- Deep Learning
- Computer Vision
- CNN
04May 27, 2026· 18 min
RNN — Architecture & Mathematics
The vanilla recurrent cell in full — the tanh state recurrence, backpropagation through time, the single-step Jacobian diag(1−h²)·Wₕ, and a rigorous spectral-norm treatment of why gradients vanish or explode, set up as the motivation for gating.
- Deep Learning
- Sequence Models
- RNN
05May 27, 2026· 20 min
LSTM — Architecture & Mathematics
Long Short-Term Memory derived from the vanishing-gradient problem — the dual state, the forget / input / output gates, the additive cell-state update, and the constant-error-carousel proof that ∂Cₜ/∂Cₜ₋₁ = diag(fₜ) keeps gradients flowing across long sequences.
- Deep Learning
- Sequence Models
- LSTM
06May 27, 2026· 18 min
GRU — Architecture & Mathematics
The Gated Recurrent Unit as a leaner LSTM — update and reset gates, convex state interpolation with a boundedness proof, the full ∂hₜ/∂hₜ₋₁ expansion, and a side-by-side parameter and gradient-path comparison with the LSTM.
- Deep Learning
- Sequence Models
- GRU
07May 17, 2026· 22 min
Bagging & Boosting, Visualized
Ensemble learning, end to end — bootstrap aggregation, random forests, AdaBoost, gradient boosting, and the bias-variance tradeoff that makes "wisdom of crowds" mathematically true.
- Classical ML
- Ensembles
- Bias-Variance
08May 16, 2026· 18 min
Big O Complexity, Visualized
The hidden cost of every algorithm. What it really means when we say "this runs in O(n log n)" — explained with race tracks, search games, and real numbers you can compare side by side.
- Computer Science
- Algorithms
- Complexity
09May 14, 2026· 30 min
LLM & RAG Evaluation, Visualized
A 39-section tour of how language models and retrieval systems are measured — perplexity, BLEU/ROUGE/METEOR, BERTScore, the LLM benchmark zoo (MMLU, HellaSwag, GSM8K, HumanEval, MT-Bench, Chatbot Arena), the IR stack (Hit@K, MRR, NDCG), and the RAG triad / RAGAS framing.
- LLM
- RAG
- Evaluation
10May 13, 2026· 35 min
Activation & Loss Functions, Visualized
The two function families that drive deep learning, end to end — from sigmoid to SwiGLU on the activation side, MSE through cross-entropy to DPO/PPO/KTO on the loss side, plus task playbooks for binary, multi-class, multi-label, and regression heads.
- Deep Learning
- Activations
- Losses
11May 12, 2026· 22 min
Classification & Regression Metrics, Visualized
A walk through the CS 229 tips-and-tricks cheat sheet — confusion matrix, precision/recall/F1, ROC and AUC for classification; sums of squares, R² / adjusted R², Mallow’s Cp, AIC, BIC for regression. What each one measures, when it lies.
- CS 229
- Metrics
- Classification
12May 11, 2026· 20 min
RNN, LSTM, GRU — Cells, Gates, and Data Flow
CS 229 companion 05. Interactive diagrams for the vanilla RNN cell, BPTT and the vanishing-gradient pathology, then the LSTM gates (forget / input / output, the cell-state highway) and the GRU’s reset / update simplification — with a side-by-side parameter comparison.
- CS 229
- Sequence Models
- RNN/LSTM/GRU
13May 8, 2026· 25 min
From Derivatives to Machine Learning — A Reference
A field-notes walk from differentiation through gradient descent to backprop, set in serif paper typography. Built as a self-contained reference for ML foundations.
- Math
- ML Foundations
- Reference