A working tour of every operation inside an autoencoder — encoder, bottleneck, decoder, distributions, the reparameterization trick, KL divergence, and the ELBO — built so each step can be watched and replayed.
A 28×28 grayscale image of a hand-written digit lives in a 784-dimensional space — one number per pixel. But the set of valid digits is much, much smaller than 784 dimensions allow. The space of "looks-like-a-real-digit" is a thin, curved manifold inside a vast volume of pixel noise.
An autoencoder's job is to find the intrinsic dimensionality of your data. If valid digits really only need ten or twenty meaningful axes of variation, an autoencoder should be able to learn those axes from data alone.
If you can compress a digit down to, say, 2 numbers without losing what makes it a digit, three useful things follow. First, you have a visualization — paint each digit as a point in 2D. Second, you have a compression scheme — store 2 floats instead of 784. Third, and most interestingly, you have an implicit model of the data: pick any 2 numbers, decode them, and (with the variational version we will build in Part II) something digit-like comes out.
This is the central premise: meaningful data lives on a low-dimensional manifold inside a high-dimensional ambient space, and a neural network can learn that manifold by being forced to bottleneck.
Three pieces, in order: an encoder hφ that maps an input x down to a low-dimensional code z; a bottleneck where the entire input is forced to live in that compressed code; and a decoder fθ that maps z back up to a reconstruction x̂ in the original space. The two networks are trained jointly to minimize the difference between x and x̂.
The widget above is the mental model for every autoencoder variant in this post. Everything that follows — denoising, sparse, variational — only changes what goes into the bottleneck or how it is regularized. The skeleton is constant.
Each encoder layer is the standard machinery: a learned linear projection, a bias, and a non-linearity (often ReLU or tanh). The dimensionality shrinks from layer to layer — for MNIST a typical descent is 784 → 256 → 64 → z_dim. For images, the linear layers are often replaced by strided convolutions that combine spatial down-sampling with feature extraction in a single step.
The encoder is doing lossy compression by gradient descent. Whatever survives the bottleneck must reconstruct the input well enough that the decoder can do its job. So the encoder learns to keep what matters for reconstruction and throw the rest away.
The bottleneck is just a vector — typically a handful of floats. For a typical image autoencoder, z ∈ ℝ32 or z ∈ ℝ64. For visualization-oriented examples (like the ones in this post) you can squeeze all the way to z ∈ ℝ2 so you can see the latent space on a flat plot.
The bottleneck size is a deliberate design choice. Too wide and you risk the network just memorising the input (the identity function is trivial when input and output dimensions match). Too narrow and reconstruction suffers. The "interesting" range is typically 10–100× narrower than the input.
The decoder is the encoder run in reverse — a series of layers that expand the code z back up to the original input shape. For image data, transposed convolutions are common; for tabular or flattened input, plain dense layers suffice.
One trick worth noting: the encoder and decoder do not need to be perfect mirrors of each other. Many implementations use a deeper decoder than encoder, or share no weights at all between them. They only need to compose well into something that returns the input.
An autoencoder is trained by minimizing how different the output is from the input. For continuous data, the canonical choice is mean-squared error; for binary or [0,1]-bounded pixel data, binary cross-entropy is more common.
Both losses are pixel-wise. They treat each output position independently and ask: "how close is the predicted value to the target value?" This is also one of the limitations of vanilla autoencoders for image data — pixel-wise loss is blind to perceptual mismatch (a one-pixel shift is "wrong everywhere" by MSE but obviously fine to a human). Modern systems often supplement with perceptual or adversarial losses, but for our purposes pixel-wise loss is enough.
Training is the standard SGD ritual: forward pass, compute loss, backward pass, update weights. The only special thing about an autoencoder is that the input and the target are the same tensor — you're not supervising against external labels, you're supervising the model against itself.
Train an autoencoder with a 2D bottleneck on MNIST, then plot every training image's code on the plane. The picture you get is informative but ugly: clusters of digits in roughly the right places, but with holes between them, irregular shapes, and arbitrary scales.
This irregular layout is a perfectly fine outcome if you only want to reconstruct or compress. But it's a problem the moment you want to generate new samples by picking a random latent point and decoding it — most random points will land in the "no man's land" between clusters and produce nonsense. Section §11 onwards solves exactly this problem.
Before VAEs, the field invented several flavours of autoencoder that add structure to the latent code without going fully probabilistic.
Add an L1 penalty on the bottleneck activations: L = L_recon + λ · Σ |z_i|. Most bottleneck dimensions are pushed to zero for any given input, leaving only a small active subset. This is useful for feature discovery — each non-zero dimension is forced to specialise in one type of input pattern.
Corrupt the input with noise before feeding it in, but ask the network to reconstruct the clean input. The network can't simply memorise — it has to learn what's "essential" enough to survive noise.
Penalise the Jacobian of the encoder with respect to its input. Concretely: L = L_recon + λ · ||∂h/∂x||²_F. This makes the encoder less sensitive to small input changes — nearby inputs map to nearby codes. A different way to encourage smooth latent structure.
The same architecture as a standard autoencoder, plus a regularizer.
An interpretable latent code where individual dimensions or local neighbourhoods carry meaning.
Even with all these tricks, the latent space remains discrete clusters in a continuous space. There's no principled way to sample from it.
That gap — between learning a useful code and being able to generate from it — is exactly the gap variational autoencoders close.
Three reasons, stacked on top of each other.
1. The latent space has no known distribution. You have no idea what shape it takes. So you don't know how to sample from it. Try a uniform distribution and you'll cover regions the encoder never visited; try a Gaussian and most samples will still miss the actual data manifold.
2. The latent space has gaps. Real points cluster around regions the encoder used; the in-between is undefined behaviour. Decoding a random point lands in undefined territory.
3. The latent space is not continuous in any guaranteed sense. Two nearby codes might decode to very different outputs because nothing in the loss function encouraged smoothness.
Variational autoencoders address all three of these in one move: they shape the latent space to be a known, smooth, continuous distribution by construction.
In a standard autoencoder, the encoder maps an input to one specific point z in latent space. A variational autoencoder makes the encoder output the parameters of a distribution instead — typically a Gaussian with mean μ and standard deviation σ. Then z is sampled from that distribution before being passed to the decoder.
That single change cascades through everything. The loss function changes (we now need to ensure the distributions themselves are well-behaved). The forward pass changes (we need to sample, which is non-differentiable in its naive form). The interpretation changes (the encoder is now doing approximate Bayesian inference). The rest of this part walks through each of those consequences in order.
Before diving into the inference machinery, here is the imagined data-generating process that the VAE assumes — the story the model believes about how the data was produced.
Step 2 is just our decoder: fθ(z) produces the parameters of pθ(x | z) — for image data, usually the mean of a Gaussian or the probabilities of independent Bernoullis. Step 1 is the crucial piece: the latent variable is assumed to come from a standard Gaussian. That's our generative prior.
If we believed this story exactly, training would be straightforward: just do maximum likelihood. Find the θ that maximises pθ(x1, ..., xn). The problem, as the next section shows, is that computing that probability requires solving an intractable integral. The whole machinery of variational inference exists to dodge that integral.
Given a fixed decoder θ, two things would be nice to know for each observed x:
By Bayes' theorem, the posterior is:
For a latent space of even modest dimensionality (50 dims, say), this integral has no closed form and Monte Carlo estimation would need an astronomical number of samples. This is the wall. Bayesian inference on a neural decoder is intractable by direct attack.
Variational inference is the workaround. Instead of computing the posterior, we approximate it with a simpler distribution qφ(z | x) that we deliberately design to be tractable — typically a Gaussian whose parameters are output by a neural network.
Pick a family of distributions Q we can sample from and evaluate easily. Inside that family, find the specific distribution that best approximates the true posterior. For VAEs, the standard choice is a diagonal-covariance Gaussian:
"Diagonal" means we assume the latent dimensions are independent given the input — a simplifying assumption that lets us describe the covariance with a single vector rather than a full matrix. It's not strictly true (latent dimensions might correlate), but the trade-off between expressiveness and tractability is worth it.
One vital subtlety: amortized inference. We could have given each data point its own φi — a separate approximate posterior per example — but we don't. We share one neural network across all data points, mapping each xi to its own (μi, σi). This is what makes inference fast and scalable; it's the whole reason VAEs are practical.
The recipe for "best approximation" needs a notion of closeness between distributions. Variational inference uses Kullback-Leibler divergence:
Two facts to keep in mind. First, KL divergence is not symmetric: KL(q || p) ≠ KL(p || q). The order matters. VAEs use KL(qφ(z|x) || p(z)), which is sometimes called "mode-seeking" because it harshly penalises q for placing mass where p has none. Second, KL between two diagonal Gaussians has a clean closed-form expression — no integral required.
For a diagonal-Gaussian q and a standard-Gaussian prior, the KL term collapses to a sum over latent dimensions:
This is the term that pulls every qφ(z|x) toward the standard normal. Without it, the encoder would happily place each input far from every other input (zero KL constraint, easy reconstruction). With it, the encoder has to share a common region of latent space — which is precisely what makes the latent space well-behaved for generation.
Putting it all together. We want to maximise the log marginal likelihood log pθ(x), which we cannot compute. The trick is a chain of algebra that introduces qφ and decomposes the log-likelihood:
Read the second line as two opposing forces. The first term rewards the model when the decoder, fed a sample from qφ(z | x), reproduces x. This is the reconstruction term. The second term penalises the model when qφ drifts away from the prior p(z). This is the regulariser.
This is the entire VAE training objective. Every term is differentiable (after one more trick, in §18). The architecture is just two neural networks. Everything else — generative interpretation, smooth latent space, ability to sample new data — is a consequence of optimising this single loss.
The encoder of a VAE has two output heads instead of one. Each head produces a vector of size z_dim — one for the mean μ, one for the log-variance log σ². (We predict log-variance, not variance, because log-space is unbounded above and below — easier to learn and numerically safer.)
Here is the trick that makes the whole thing trainable. After the encoder produces μ and σ², we need to sample z ~ N(μ, σ²) and pass it to the decoder. But sampling is stochastic, and gradients cannot flow through stochastic operations. We'd get stuck.
The reparameterization trick "factors out" the randomness. Instead of sampling from N(μ, σ²) directly, we sample ε from a standard normal and transform it deterministically with μ and σ.
This isn't a numerical hack. It's a structural change to the computation graph: stochasticity is pushed outside the differentiable region. μ and σ are deterministic functions of x, so backpropagation works perfectly — gradients flow from the loss, through z, through μ and σ, back into the encoder weights, all while a single sample of ε sits placidly off to the side as input data.
Once z is drawn via reparameterization, the decoder runs exactly as in a standard autoencoder: a mirror network that expands z back into a reconstruction. The only thing the decoder "sees" is a single vector; it has no idea that vector came from a distribution rather than a fixed code.
That's a feature, not a bug. The randomness during training is what forces the decoder to be robust: it doesn't get to memorise specific codes, because the encoder will deliver slightly different codes for the same input every time. So the decoder learns to handle the neighbourhood around each input's mean code, not just the mean itself.
Now we have all the pieces. Stitched together, a complete VAE forward pass looks like this:
This is the same diagram that lives in every VAE paper — but seeing the signal move through it is the part that usually has to live in your head.
The full training loss for a single example is the negative ELBO, which we already saw:
The two terms pull in opposite directions. The reconstruction loss wants each qφ(z | x) to be a tight, distinctive point — easy for the decoder to invert. The KL term wants every qφ(z | x) to look like the standard normal — overlapping, generic, indistinguishable. The model has to find the compromise.
The standard VAE weights both loss terms equally. The β-VAE introduces a tunable knob that controls how strongly the KL regulariser is enforced:
Latent dimensions are pushed harder toward independence. In some setups this encourages disentangled representations.
Generated samples look more on-distribution but lose fine detail.
Reconstructions sharpen, fine-grained detail is preserved.
But the latent space starts looking like a standard AE — clustered, gappy, hard to sample from.
Once a VAE is trained, the decoder can be run on any point in latent space — not just codes produced by the encoder. With a 2D latent space, you can visualise the entire learned manifold by walking a grid of z values and decoding each one.
This grid-walk is the canonical VAE visualisation. Each axis of the latent space corresponds to some aspect of variation in the training data.
Pick any two real inputs x1 and x2. Encode them to z1 and z2. Walk a straight line between them in latent space and decode every point along the way. With a VAE, you get a smooth morph from one input to the other.
The same architecture, the same dataset, different objectives. The right column adds the KL term; that's the only difference.
| Property | Standard AE | VAE |
|---|---|---|
| Encoder output | A single point z = hφ(x) | Distribution params (μ, σ²) over z |
| Bottleneck behaviour | Deterministic | Stochastic (z is sampled) |
| Loss function | Reconstruction only | Reconstruction + KL(q || p) |
| Latent space shape | Whatever shape gradient descent finds | Pulled toward N(0, I) |
| Can generate? | No — no defined sampling distribution | Yes — sample z ~ N(0, I), decode |
| Interpolation | Often produces blur or holes | Smooth, on-manifold |
| Use case | Compression, anomaly detection | Generation, smooth latent manipulation |
Everything we've covered, in one self-contained module. Read alongside the relevant section above as a kind of glossary.
| Domain | What the VAE provides |
|---|---|
| Image generation | Smooth manifold of faces, digits, characters; precursor to diffusion models. |
| Latent diffusion models | Stable Diffusion runs diffusion in a compressed latent space — that space is a VAE. |
| Anomaly detection | Reconstruction error flags samples that don't fit the learned distribution. |
| Drug & molecule design | A VAE over SMILES strings lets you search continuously for molecules. |
| Single-cell genomics | scVI models RNA-seq counts via VAE with zero-inflated negative binomial. |
| Speech & music | Hierarchical VAEs underpin modern generative audio. |
| Variant | Idea |
|---|---|
| β-VAE | Multiply the KL term by β; β > 1 encourages disentangled representations. |
| Conditional VAE | Encoder and decoder receive a conditioning signal y for controllable generation. |
| VQ-VAE | Replace the continuous latent with a discrete codebook. |
| NVAE | Hierarchical VAE that competes with GANs on sample quality. |
| IWAE | Tighter ELBO via multiple importance-weighted samples. |
| InfoVAE | Replace KL with MMD to dodge posterior collapse. |
The VAE is one of the cleanest examples in deep learning of an algorithm whose structure emerges logically from a probabilistic objective.
Every operation in an autoencoder/VAE, written out as numbered procedures. Each step is the smallest meaningful unit — read top-to-bottom and you have the whole algorithm.
Encoder hφ (input → bottleneck) and decoder fθ (bottleneck → output). Both with random weights. Bottleneck dimension is a fixed hyperparameter (e.g. 32).
Draw a batch {x1, ..., xB} from the training set. Same data both as input and as target.
Compute zi = hφ(xi) for every example. zi is a single vector in latent space.
Compute x̂i = fθ(zi) — the reconstruction.
MSE for continuous, BCE for [0, 1] outputs:
L = (1/B) · Σi ‖xi − x̂i‖²Gradients flow from L backward through the decoder, then through the bottleneck, then through the encoder.
φ ← φ − η · ∇φ L, θ ← θ − η · ∇θ L. Adam, AdamW, or SGD — pick your favourite optimiser.
Loop steps 2-7 over the dataset for the desired number of epochs. Stop when validation loss plateaus.
Encoder body shared, plus two output heads producing μφ and log σ²φ. Decoder fθ. Pick latent dimension z_dim.
Draw {x1, ..., xB}. Both input and reconstruction target.
For every xi the encoder outputs two vectors:
μi, log σ²i = hφ(xi)Sample εi ~ N(0, I) independently per example, then compute:
zi = μi + exp(½ · log σ²i) · εiThis is the only stochastic step. ε is an input, not a parameter — gradients pass through μ and σ cleanly.
Compute the reconstruction x̂i = fθ(zi).
Closed-form between N(μ, σ²I) and N(0, I):
LKL = −½ · Σj (1 + log σ²j − μ²j − σ²j)β = 1 recovers the standard VAE.
Gradients flow: L → decoder → z → (μ, log σ²) → encoder body. ε sits to the side as a constant input.
Loop steps 2-9. Watch both Lrecon and LKL — they should both decrease, then trade off.
Both are differentiable functions of x and φ. The goal: sample z ~ N(μ, σ²) in a way that lets gradients flow.
Predicting log-variance is numerically safer.
Same shape as z. No parameters depend on ε.
By construction z ~ N(μ, σ²). The whole expression is differentiable in μ and σ.
During backprop, ∂z/∂μ = 1 and ∂z/∂σ = ε. Both well-defined.
Intractable as-is because of the integral over z.
Multiply and divide by qφ(z | x) inside the integral.
log pθ(x) = log ∫ qφ(z|x) · [pθ(x, z) / qφ(z|x)] · dzlog is concave, so log E[Y] ≥ E[log Y]:
log pθ(x) ≥ Eq[log pθ(x, z) − log qφ(z | x)]First term = reconstruction. Second term = regulariser. Maximize ELBO ⇔ minimise VAE loss.
For generation you only need fθ.
Each fresh z draws a fresh sample.
A vector of means μ ∈ ℝJ and a vector of log-variances log σ² ∈ ℝJ.
Mean over examples in the minibatch. Add to the reconstruction loss.
Worked-out questions that appear in textbooks, PhD qualifying exams, and "explain it on a whiteboard" interviews.
Using the identity log pθ(x) = log pθ(x) · 1 with 1 = ∫ qφ(z | x) dz:
The first term is the ELBO. The second is a non-negative KL. So:
Consequences: (i) ELBO ≤ log pθ(x). (ii) Maximising ELBO over φ tightens the bound. (iii) Joint maximisation over (φ, θ) is reasonable approximate maximum likelihood.
For univariate q = N(μ, σ²) and p = N(0, 1):
For a Gaussian, Eq[log q(z)] = −½ · (log(2πσ²) + 1). And Eq[log p(z)] = −½·log(2π) − ½·(μ² + σ²) using E[z²] = μ² + σ². Subtract:
For J-dimensional diagonal q, sum over dimensions:
Direct sampling: z comes from a random oracle, not a function. There's no analytical map between (μ, σ) and the observed z to differentiate.
Reparameterized: z = μ + σ·ε with ε ~ N(0, 1) is a deterministic function of (μ, σ, ε). The random oracle was called before we touched μ and σ.
Both well-defined, so gradients flow from L through z back into μ and σ, then into encoder weights φ.
From Q1: log pθ(x) = ELBO + KL(qφ(z|x) ‖ pθ(z|x)).
For fixed θ, the LHS is constant. So increasing ELBO by Δ must decrease the KL gap by exactly Δ. Therefore:
This is what makes the ELBO a tractable surrogate.
(i) Positivity for free. Variance must be positive. log-space removes the constraint.
(ii) Numerical stability. log σ² ∈ (−∞, ∞); tiny variances become moderate negative numbers instead of vanishing floats.
(iii) Cleaner KL term. The closed-form KL contains log σ² directly.
Implementation: σ = exp(½ · log σ²) when σ is needed.
Reparameterized: ∇φ L = Eε[ ∇φ ℓ(x, x̂(ε, μφ, σφ)) ]. Differentiates the loss function.
Score-function: ∇φ L = Ez ~ qφ[ ℓ(x, z) · ∇φ log qφ(z|x) ]. Multiplies the gradient of log q by the raw loss value — which is noisy.
In practice the reparameterized version has 1–3 orders of magnitude lower variance, which is why VAE training works with just one Monte Carlo sample per data point.
This identity is the workhorse of the KL derivation in Q2.
KL is asymmetric because q · log(q/p) weights by q's density, not p's.
Reverse KL falls out naturally from the ELBO. Using forward KL would be intractable — it would require sampling from the true posterior, which is what we don't know how to do.
For pθ(x | z) = N(fθ(z), σ²dec · I):
So arg max log pθ(x | z) = arg min ‖x − fθ(z)‖² — mean-squared error.
σ²dec becomes an implicit β-like weight between reconstruction and KL.
For pθ(x | z) = ∏i Bernoulli(xi; πi) with π = fθ(z):
Negating gives binary cross-entropy:
β = 0: KL disappears. Encoder collapses to a Dirac delta. Model becomes a standard autoencoder. Generation from N(0, I) produces garbage.
β → ∞: Only KL matters. Optimal q is the prior, regardless of x — posterior collapse. Reconstruction drops to the level of decoding pure noise.
Sweet spot: β ∈ [0.1, 4] depending on dataset and decoder capacity.
Zero when q matches the prior exactly — the attractor that the regulariser pulls every q(z|x) toward.
The conceptual questions interviewers ask. Short, complete answers — the kind you'd give in 60 seconds at a whiteboard.
An autoencoder maps each input to a point in latent space; a VAE maps each input to a distribution over latent space and adds a KL regulariser that pulls those distributions toward a fixed prior — which is what makes the latent space sampleable for generation.
Because the VAE's KL term forces every q(z|x) to look like a known distribution — typically N(0, I). So you can sample fresh z ~ N(0, I) at test time and feed it to the decoder, and that z will land in the same region of latent space the decoder was trained on. With a standard AE you have no idea what distribution to sample from.
Instead of learning a separate posterior qi(z) for each data point, a single shared neural network qφ(z | x) is trained to output posterior parameters for any input. You "amortize" the cost of inference over the whole dataset by paying once to train one network.
What: the encoder collapses to q(z|x) ≈ p(z) for every x — KL ≈ 0, reconstruction stays bad.
Detect: log per-dimension KL during training. If near-zero with poor reconstruction, that's collapse.
Fix: KL annealing; free bits; weaker decoder; InfoVAE/MMD; richer posterior (normalizing flows).
Two reasons. (i) Pixel-wise Gaussian/Bernoulli decoders optimise mode-averaging losses — the optimal MSE answer when multiple valid reconstructions exist is the pixel-wise mean (blurry). (ii) Reparameterized sampling adds noise; the decoder smooths to be robust. Fixes: richer output distributions (PixelCNN), perceptual/adversarial losses, VQ-VAE with autoregressive priors.
VAE: meaningful latent space (interpolation, anomaly detection), probabilistic interpretation, easy training, posterior access. GAN: maximum visual quality, no explicit likelihood needed. Modern hybrids (VQ-VAE-2, VQ-GAN) take the best of both.
A diffusion model can be viewed as a hierarchical VAE with a fixed encoder (forward noising) and a learned decoder (denoising network) at every timestep. Both optimise a variational lower bound. Stable Diffusion uses a VAE to compress to a small latent space, then runs diffusion inside it.
It assumes latent dimensions are conditionally independent given x. Loss in expressiveness is small for typical data, gain in tractability is huge: covariance is a vector (J floats) instead of a matrix (J²). For richer posteriors, switch to normalizing flows or full covariance.
Both encoder and decoder receive an extra conditioning input y (e.g. a class label). Loss: L = Lrecon(x, x̂ | y) + KL(q(z | x, y) ‖ p(z | y)). Sample z ~ p(z | y) at generation time and decode with the desired y — controllable generation.
Reconstruction improves (more channels through bottleneck). KL grows linearly with z_dim. Visualisation gets harder. Per-dim KL becomes unevenly used — some dimensions encode structure, others stay at the prior.
Not directly with reparameterization. Workarounds: Gumbel-Softmax (continuous relaxation); VQ-VAE (vector-quantize z, straight-through gradients); REINFORCE (high variance). VQ-VAE became dominant.
Posterior collapse with strong KL. Checklist: (1) Inspect per-dim KL — lower β or use annealing if all zero. (2) Check decoder's z-weights — if near zero, decoder ignores z. (3) Reduce decoder capacity. (4) Use free bits. (5) Try a richer posterior.
Mixture-of-Gaussians prior: better for multi-modal data. VampPrior: learned mixture, often outperforms standard. Normalizing flow prior: arbitrarily flexible. Standard normal is the default because closed-form KL is free.
Batch x: [B, 784]. (1) Encoder body: [B, 784] → [B, 256]. (2) Two heads: [B, 256] → [B, 20] each. (3) Sample ε: [B, 20]. (4) z = μ + exp(½·log σ²)·ε. (5) Decoder: [B, 20] → [B, 784], sigmoid. (6) BCE summed across pixels, KL summed across dims, both mean over batch. (7) Backward, step.
The evidence is the marginal likelihood of the data: pθ(x) = ∫ pθ(x, z) dz. ELBO is a tractable function that is a lower bound on the log of that evidence.
Train on normal data only. At test time: (i) high reconstruction loss; (ii) low likelihood under prior (encoded μ far from origin); (iii) high KL. Combine or use just (i) for a clean threshold. Works for fraud, defect inspection, network intrusion.