Notebook excerpts
A plain-text scan of every section in this note — the interactive, fully-styled version is in the reader above. Use whichever helps.
01
§ 01 Why compress at all?
A 28×28 grayscale image of a hand-written digit lives in a 784-dimensional space — one number per pixel. But the set of valid digits is much, much smaller than 784 dimensions allow. The space of "looks-like-a-real-digit" is a thin, curved manifold inside a vast volume of pixel noise.
02
§ 02 Anatomy of an autoencoder
Three pieces, in order: an encoder h φ that maps an input x down to a low-dimensional code z ; a bottleneck where the entire input is forced to live in that compressed code; and a decoder f θ that maps z back up to a reconstruction x̂ in the original space. The two networks are trained jointly to minimize the difference between x and x̂ .
03
§ 03 The encoder, layer by layer
Each encoder layer is the standard machinery: a learned linear projection, a bias, and a non-linearity (often ReLU or tanh ). The dimensionality shrinks from layer to layer — for MNIST a typical descent is 784 → 256 → 64 → z_dim . For images, the linear layers are often replaced by strided convolutions that combine spatial down-sampling with feature extraction in a single step.
04
§ 04 The bottleneck
The bottleneck is just a vector — typically a handful of floats. For a typical image autoencoder, z ∈ ℝ 32 or z ∈ ℝ 64 . For visualization-oriented examples (like the ones in this post) you can squeeze all the way to z ∈ ℝ 2 so you can see the latent space on a flat plot.
05
§ 05 The decoder
The decoder is the encoder run in reverse — a series of layers that expand the code z back up to the original input shape. For image data, transposed convolutions are common; for tabular or flattened input, plain dense layers suffice.
06
§ 06 Reconstruction loss
An autoencoder is trained by minimizing how different the output is from the input. For continuous data, the canonical choice is mean-squared error; for binary or [0,1]-bounded pixel data, binary cross-entropy is more common.
07
§ 07 Training loop, animated
Training is the standard SGD ritual: forward pass, compute loss, backward pass, update weights. The only special thing about an autoencoder is that the input and the target are the same tensor — you're not supervising against external labels, you're supervising the model against itself.
08
§ 08 The latent space of an autoencoder
Train an autoencoder with a 2D bottleneck on MNIST, then plot every training image's code on the plane. The picture you get is informative but ugly: clusters of digits in roughly the right places, but with holes between them, irregular shapes, and arbitrary scales.
09
§ 09 Useful variants — sparse, denoising, contractive
Before VAEs, the field invented several flavours of autoencoder that add structure to the latent code without going fully probabilistic.
10
§ 10 Why a standard autoencoder cannot generate
1. The latent space has no known distribution. You have no idea what shape it takes. So you don't know how to sample from it. Try a uniform distribution and you'll cover regions the encoder never visited; try a Gaussian and most samples will still miss the actual data manifold.
11
§ 11 From points to distributions
In a standard autoencoder, the encoder maps an input to one specific point z in latent space. A variational autoencoder makes the encoder output the parameters of a distribution instead — typically a Gaussian with mean μ and standard deviation σ . Then z is sampled from that distribution before being passed to the decoder.
12
§ 12 VAE as a generative story
Before diving into the inference machinery, here is the imagined data-generating process that the VAE assumes — the story the model believes about how the data was produced.
13
§ 13 The intractable posterior
Given a fixed decoder θ , two things would be nice to know for each observed x :
14
§ 14 The variational family
Pick a family of distributions Q we can sample from and evaluate easily. Inside that family, find the specific distribution that best approximates the true posterior. For VAEs, the standard choice is a diagonal-covariance Gaussian:
15
§ 15 KL divergence, visualized
The recipe for "best approximation" needs a notion of closeness between distributions . Variational inference uses Kullback-Leibler divergence :
16
§ 16 The ELBO — Evidence Lower Bound
Putting it all together. We want to maximise the log marginal likelihood log p θ (x) , which we cannot compute. The trick is a chain of algebra that introduces q φ and decomposes the log-likelihood:
17
§ 17 Encoder outputs μ and log σ²
The encoder of a VAE has two output heads instead of one. Each head produces a vector of size z_dim — one for the mean μ , one for the log-variance log σ² . (We predict log-variance, not variance, because log-space is unbounded above and below — easier to learn and numerically safer.)
18
§ 18 The reparameterization trick
Here is the trick that makes the whole thing trainable. After the encoder produces μ and σ² , we need to sample z ~ N(μ, σ²) and pass it to the decoder. But sampling is stochastic , and gradients cannot flow through stochastic operations. We'd get stuck.
19
§ 19 Sampling, then decoding
Once z is drawn via reparameterization, the decoder runs exactly as in a standard autoencoder: a mirror network that expands z back into a reconstruction. The only thing the decoder "sees" is a single vector; it has no idea that vector came from a distribution rather than a fixed code.
20
§ 20 Full forward pass, animated
Now we have all the pieces. Stitched together, a complete VAE forward pass looks like this:
21
§ 21 The two-term loss in practice
The full training loss for a single example is the negative ELBO, which we already saw:
22
§ 22 β-VAE — the regularization knob
The standard VAE weights both loss terms equally. The β-VAE introduces a tunable knob that controls how strongly the KL regulariser is enforced:
23
§ 23 Sweeping the latent space
Once a VAE is trained, the decoder can be run on any point in latent space — not just codes produced by the encoder. With a 2D latent space, you can visualise the entire learned manifold by walking a grid of z values and decoding each one.
24
§ 24 Interpolation between samples
Pick any two real inputs x 1 and x 2 . Encode them to z 1 and z 2 . Walk a straight line between them in latent space and decode every point along the way. With a VAE, you get a smooth morph from one input to the other.
25
§ 25 VAE vs standard AE — side by side
The same architecture, the same dataset, different objectives. The right column adds the KL term; that's the only difference.
26
§ 26 A complete VAE in PyTorch
Everything we've covered, in one self-contained module. Read alongside the relevant section above as a kind of glossary.
27
§ 27 Applications & further reading
The VAE is one of the cleanest examples in deep learning of an algorithm whose structure emerges logically from a probabilistic objective.
28
§ 28 Algorithm flows, step by step
Every operation in an autoencoder/VAE, written out as numbered procedures. Each step is the smallest meaningful unit — read top-to-bottom and you have the whole algorithm.
29
§ 29 Mathematical derivations — Q&A
Worked-out questions that appear in textbooks, PhD qualifying exams, and "explain it on a whiteboard" interviews.
30
§ 30 Interview questions — quick answers
The conceptual questions interviewers ask. Short, complete answers — the kind you'd give in 60 seconds at a whiteboard.