Autoencoders & variational
autoencoders, animated.

A working tour of every operation inside an autoencoder — encoder, bottleneck, decoder, distributions, the reparameterization trick, KL divergence, and the ELBO — built so each step can be watched and replayed.

14 interactive demos, every step animated Approx. 50 min read Click any widget to play / drag any slider

Contents

Part I · Autoencoders

Intuition

Why compress at all?
Anatomy of an autoencoder

Mechanics

The encoder, layer by layer
The bottleneck
The decoder
Reconstruction loss
Training loop, animated

Latent space & variants

The latent space of an AE
Variants: sparse, denoising, contractive
Why standard AEs cannot generate

Part II · Variational autoencoders

The probabilistic leap

From points to distributions
VAE as a generative story
The intractable posterior

Variational inference

The variational family
KL divergence, visualized
The ELBO

Building the network

Encoder outputs μ and log σ²
The reparameterization trick
Sampling, decoding
Full forward pass, animated
The two-term loss
β-VAE — the regularization knob

Generation & latent space

Sweeping the latent space
Interpolation between samples
VAE vs AE — side by side

In code & in practice

A complete PyTorch VAE
Applications & further reading

Part III · Reference & practice

Algorithms, step by step

Algorithm flows

Practice Q&A

Mathematical derivations
Interview questions

Part I

Autoencoders

A neural network that squeezes itself: copy your input back out, but pass it through a deliberately narrow middle. What survives the squeeze is the model's idea of "important".

§ 01Why compress at all?

A 28×28 grayscale image of a hand-written digit lives in a 784-dimensional space — one number per pixel. But the set of valid digits is much, much smaller than 784 dimensions allow. The space of "looks-like-a-real-digit" is a thin, curved manifold inside a vast volume of pixel noise.

An autoencoder's job is to find the intrinsic dimensionality of your data. If valid digits really only need ten or twenty meaningful axes of variation, an autoencoder should be able to learn those axes from data alone.

If you can compress a digit down to, say, 2 numbers without losing what makes it a digit, three useful things follow. First, you have a visualization — paint each digit as a point in 2D. Second, you have a compression scheme — store 2 floats instead of 784. Third, and most interestingly, you have an implicit model of the data: pick any 2 numbers, decode them, and (with the variational version we will build in Part II) something digit-like comes out.

Step into noise0%

at 0% you are on the digit manifold; at 100% you are at a random point in pixel space.

The "valid-digit" manifold is a vanishingly thin sliver of pixel space. An autoencoder learns coordinates along this sliver and discards everything else.

This is the central premise: meaningful data lives on a low-dimensional manifold inside a high-dimensional ambient space, and a neural network can learn that manifold by being forced to bottleneck.

§ 02Anatomy of an autoencoder

Three pieces, in order: an encoder h_φ that maps an input x down to a low-dimensional code z; a bottleneck where the entire input is forced to live in that compressed code; and a decoder f_θ that maps z back up to a reconstruction x̂ in the original space. The two networks are trained jointly to minimize the difference between x and x̂.

x → h_φ(x) → z → f_θ(z) → x̂ Two neural nets stacked back-to-back, with a forced narrow waist in the middle.

The widget above is the mental model for every autoencoder variant in this post. Everything that follows — denoising, sparse, variational — only changes what goes into the bottleneck or how it is regularized. The skeleton is constant.

§ 03The encoder, layer by layer

Each encoder layer is the standard machinery: a learned linear projection, a bias, and a non-linearity (often ReLU or tanh). The dimensionality shrinks from layer to layer — for MNIST a typical descent is 784 → 256 → 64 → z_dim. For images, the linear layers are often replaced by strided convolutions that combine spatial down-sampling with feature extraction in a single step.

h_ℓ = σ(W_ℓ · h_ℓ−1 + b_ℓ) σ is a non-linearity. W_ℓ shrinks the dimensionality at each step.

The encoder is doing lossy compression by gradient descent. Whatever survives the bottleneck must reconstruct the input well enough that the decoder can do its job. So the encoder learns to keep what matters for reconstruction and throw the rest away.

§ 04The bottleneck

The bottleneck is just a vector — typically a handful of floats. For a typical image autoencoder, z ∈ ℝ³² or z ∈ ℝ⁶⁴. For visualization-oriented examples (like the ones in this post) you can squeeze all the way to z ∈ ℝ² so you can see the latent space on a flat plot.

Bottleneck size8

dimensions allowed through the narrow waist.

Reconstruction qualitygood

approximate, modelled after MNIST behaviour.

Original

Reconstruction at this bottleneck

With a generous bottleneck the network can store the image almost verbatim; with a tight bottleneck it has to keep only the most informative axes and reconstruct the rest from learned priors.

The bottleneck size is a deliberate design choice. Too wide and you risk the network just memorising the input (the identity function is trivial when input and output dimensions match). Too narrow and reconstruction suffers. The "interesting" range is typically 10–100× narrower than the input.

§ 05The decoder

The decoder is the encoder run in reverse — a series of layers that expand the code z back up to the original input shape. For image data, transposed convolutions are common; for tabular or flattened input, plain dense layers suffice.

x̂ = f_θ(z) = σ(W'_L · σ(W'_L−1 · ... σ(W'₁ · z + b'₁)...) + b'_L) A mirror of the encoder. Final activation is usually sigmoid for [0,1] pixel data, or linear for unbounded outputs.

One trick worth noting: the encoder and decoder do not need to be perfect mirrors of each other. Many implementations use a deeper decoder than encoder, or share no weights at all between them. They only need to compose well into something that returns the input.

§ 06Reconstruction loss

An autoencoder is trained by minimizing how different the output is from the input. For continuous data, the canonical choice is mean-squared error; for binary or [0,1]-bounded pixel data, binary cross-entropy is more common.

L_MSE = (1/N) · Σ_i ||x_i − f_θ(h_φ(x_i))||²
L_BCE = −Σ_i,j [ x_ij log x̂_ij + (1 − x_ij) log (1 − x̂_ij) ] MSE assumes Gaussian observation noise; BCE assumes Bernoulli (each pixel is "on or off" with probability x̂).

Both losses are pixel-wise. They treat each output position independently and ask: "how close is the predicted value to the target value?" This is also one of the limitations of vanilla autoencoders for image data — pixel-wise loss is blind to perceptual mismatch (a one-pixel shift is "wrong everywhere" by MSE but obviously fine to a human). Modern systems often supplement with perceptual or adversarial losses, but for our purposes pixel-wise loss is enough.

§ 07Training loop, animated

Training is the standard SGD ritual: forward pass, compute loss, backward pass, update weights. The only special thing about an autoencoder is that the input and the target are the same tensor — you're not supervising against external labels, you're supervising the model against itself.

Epoch0

progress through 30 simulated epochs.

Reconstruction loss—

average MSE per pixel (simulated).

Target

Current reconstruction

# Vanilla autoencoder training loop in PyTorch for epoch in range(num_epochs): for batch in dataloader: x = batch[0] # input doubles as target x_hat = model(x) # forward through encoder→bottleneck→decoder loss = F.mse_loss(x_hat, x) # pixel-wise reconstruction optimizer.zero_grad() loss.backward() # gradients flow back through both networks optimizer.step() # update φ and θ together

§ 08The latent space of an autoencoder

Train an autoencoder with a 2D bottleneck on MNIST, then plot every training image's code on the plane. The picture you get is informative but ugly: clusters of digits in roughly the right places, but with holes between them, irregular shapes, and arbitrary scales.

This irregular layout is a perfectly fine outcome if you only want to reconstruct or compress. But it's a problem the moment you want to generate new samples by picking a random latent point and decoding it — most random points will land in the "no man's land" between clusters and produce nonsense. Section §11 onwards solves exactly this problem.

§ 09Useful variants — sparse, denoising, contractive

Before VAEs, the field invented several flavours of autoencoder that add structure to the latent code without going fully probabilistic.

Sparse autoencoders

Add an L1 penalty on the bottleneck activations: L = L_recon + λ · Σ |z_i|. Most bottleneck dimensions are pushed to zero for any given input, leaving only a small active subset. This is useful for feature discovery — each non-zero dimension is forced to specialise in one type of input pattern.

Denoising autoencoders

Corrupt the input with noise before feeding it in, but ask the network to reconstruct the clean input. The network can't simply memorise — it has to learn what's "essential" enough to survive noise.

x̃ = x + ε · L = ||x − f_θ(h_φ(x̃))||² Network sees x̃ but is graded against x. Forces robustness to corruption.

Contractive autoencoders

Penalise the Jacobian of the encoder with respect to its input. Concretely: L = L_recon + λ · ||∂h/∂x||²_F. This makes the encoder less sensitive to small input changes — nearby inputs map to nearby codes. A different way to encourage smooth latent structure.

All three share

The same architecture as a standard autoencoder, plus a regularizer.

An interpretable latent code where individual dimensions or local neighbourhoods carry meaning.

None of them solve generation

Even with all these tricks, the latent space remains discrete clusters in a continuous space. There's no principled way to sample from it.

That gap — between learning a useful code and being able to generate from it — is exactly the gap variational autoencoders close.

§ 10Why a standard autoencoder cannot generate

Three reasons, stacked on top of each other.

1. The latent space has no known distribution. You have no idea what shape it takes. So you don't know how to sample from it. Try a uniform distribution and you'll cover regions the encoder never visited; try a Gaussian and most samples will still miss the actual data manifold.

2. The latent space has gaps. Real points cluster around regions the encoder used; the in-between is undefined behaviour. Decoding a random point lands in undefined territory.

3. The latent space is not continuous in any guaranteed sense. Two nearby codes might decode to very different outputs because nothing in the loss function encouraged smoothness.

Variational autoencoders address all three of these in one move: they shape the latent space to be a known, smooth, continuous distribution by construction.

Part II

Variational autoencoders

Same skeleton — encoder, bottleneck, decoder — but the bottleneck now produces a distribution, not a point. The whole architecture is forced to play nicely with probability.

§ 11From points to distributions

In a standard autoencoder, the encoder maps an input to one specific point z in latent space. A variational autoencoder makes the encoder output the parameters of a distribution instead — typically a Gaussian with mean μ and standard deviation σ. Then z is sampled from that distribution before being passed to the decoder.

Standard AE: x → z = h_φ(x)
VAE: x → μ, σ = h_φ(x) → z ~ N(μ, σ²) A single point is replaced by a recipe for sampling points.

σ (spread of the distribution)0.30

at σ=0 the VAE collapses to a standard AE; at large σ the encoder is "uncertain" — one input maps to a region.

Forcing the encoder to commit to a region rather than a point is what makes the latent space smooth. Neighbours in the input map to overlapping Gaussians in latent space — and that overlap is what the decoder learns to interpolate across.

That single change cascades through everything. The loss function changes (we now need to ensure the distributions themselves are well-behaved). The forward pass changes (we need to sample, which is non-differentiable in its naive form). The interpretation changes (the encoder is now doing approximate Bayesian inference). The rest of this part walks through each of those consequences in order.

§ 12VAE as a generative story

Before diving into the inference machinery, here is the imagined data-generating process that the VAE assumes — the story the model believes about how the data was produced.

1. z ~ p(z) = N(0, I) // pick a low-dim point from a standard normal
2. x ~ p_θ(x | z) // pass it through a neural net f_θ, then sample Two steps. Sample a latent. Decode it through a parametric distribution.

Step 2 is just our decoder: f_θ(z) produces the parameters of p_θ(x | z) — for image data, usually the mean of a Gaussian or the probabilities of independent Bernoullis. Step 1 is the crucial piece: the latent variable is assumed to come from a standard Gaussian. That's our generative prior.

Why this is powerful

Even though p(z) is simple (a standard Gaussian) and p_θ(x | z) may also be simple (an independent Gaussian per pixel), the non-linear mapping f_θ can warp them into an arbitrarily complex p_θ(x). That's how a VAE fits curvy, multi-modal distributions while only ever working with simple distributions internally.

If we believed this story exactly, training would be straightforward: just do maximum likelihood. Find the θ that maximises p_θ(x₁, ..., x_n). The problem, as the next section shows, is that computing that probability requires solving an intractable integral. The whole machinery of variational inference exists to dodge that integral.

§ 13The intractable posterior

Given a fixed decoder θ, two things would be nice to know for each observed x:

The posterior p_θ(z | x) — given this observation, what latent z probably produced it?
The marginal likelihood p_θ(x) — how probable is this observation under the model? (Maximising this is maximum likelihood.)

By Bayes' theorem, the posterior is:

p_θ(z | x) = p_θ(x | z) · p(z) / p_θ(x)
p_θ(x) = ∫ p_θ(x | z) · p(z) · dz The denominator marginalizes z over the full latent space — a multi-dimensional integral with no closed form for any non-trivial f_θ.

For a latent space of even modest dimensionality (50 dims, say), this integral has no closed form and Monte Carlo estimation would need an astronomical number of samples. This is the wall. Bayesian inference on a neural decoder is intractable by direct attack.

Variational inference is the workaround. Instead of computing the posterior, we approximate it with a simpler distribution q_φ(z | x) that we deliberately design to be tractable — typically a Gaussian whose parameters are output by a neural network.

§ 14The variational family

Pick a family of distributions Q we can sample from and evaluate easily. Inside that family, find the specific distribution that best approximates the true posterior. For VAEs, the standard choice is a diagonal-covariance Gaussian:

q_φ(z | x) = N(μ_φ(x), diag(σ²_φ(x))) μ and σ² are themselves outputs of a neural network — the encoder.

"Diagonal" means we assume the latent dimensions are independent given the input — a simplifying assumption that lets us describe the covariance with a single vector rather than a full matrix. It's not strictly true (latent dimensions might correlate), but the trade-off between expressiveness and tractability is worth it.

One vital subtlety: amortized inference. We could have given each data point its own φ_i — a separate approximate posterior per example — but we don't. We share one neural network across all data points, mapping each x_i to its own (μ_i, σ_i). This is what makes inference fast and scalable; it's the whole reason VAEs are practical.

Pick input x3

choose which of 10 example digits we are encoding.

Encoder uncertaintymoderate

how confident the encoder is — small σ for sharp, large σ for fuzzy.

Each input maps to its own little Gaussian in latent space. Inputs that look similar end up with overlapping Gaussians — and that overlap is what eventually gives the VAE smooth interpolation.

§ 15KL divergence, visualized

The recipe for "best approximation" needs a notion of closeness between distributions. Variational inference uses Kullback-Leibler divergence:

KL(q || p) = ∫ q(z) · log [ q(z) / p(z) ] · dz = E_{z ~ q}[ log q(z) − log p(z) ] Asymmetric. Zero when q = p. Always non-negative. Penalizes q for putting mass where p has none.

Two facts to keep in mind. First, KL divergence is not symmetric: KL(q || p) ≠ KL(p || q). The order matters. VAEs use KL(q_φ(z|x) || p(z)), which is sometimes called "mode-seeking" because it harshly penalises q for placing mass where p has none. Second, KL between two diagonal Gaussians has a clean closed-form expression — no integral required.

μ of q2.0

mean of the variational distribution.

σ of q1.5

standard deviation of the variational distribution.

KL(q || N(0,1))

—

closed form for two Gaussians

Mean penalty

—

μ² / 2

Variance penalty

—

(σ² − 1 − log σ²) / 2

Presets

For a diagonal-Gaussian q and a standard-Gaussian prior, the KL term collapses to a sum over latent dimensions:

KL(q_φ(z | x) || N(0, I)) = −½ · Σ_j ( 1 + log σ²_j − μ²_j − σ²_j ) No integral. No Monte Carlo. A simple algebraic expression in terms of the encoder's outputs.

This is the term that pulls every q_φ(z|x) toward the standard normal. Without it, the encoder would happily place each input far from every other input (zero KL constraint, easy reconstruction). With it, the encoder has to share a common region of latent space — which is precisely what makes the latent space well-behaved for generation.

§ 16The ELBO — Evidence Lower Bound

Putting it all together. We want to maximise the log marginal likelihood log p_θ(x), which we cannot compute. The trick is a chain of algebra that introduces q_φ and decomposes the log-likelihood:

log p_θ(x) = ELBO(φ, θ) + KL( q_φ(z | x) || p_θ(z | x) )
ELBO(φ, θ) := E_{z ~ q_φ}[ log p_θ(x | z) ] − KL( q_φ(z | x) || p(z) ) Because KL is non-negative, the ELBO is a lower bound on log p_θ(x). Maximize the bound, you push up the likelihood.

Read the second line as two opposing forces. The first term rewards the model when the decoder, fed a sample from q_φ(z | x), reproduces x. This is the reconstruction term. The second term penalises the model when q_φ drifts away from the prior p(z). This is the regulariser.

L_VAE(x) = −E_{z ~ q_φ}[ log p_θ(x | z) ] + KL( q_φ(z | x) || p(z) ) The negative ELBO. Minimizing this is equivalent to maximizing the ELBO.

This is the entire VAE training objective. Every term is differentiable (after one more trick, in §18). The architecture is just two neural networks. Everything else — generative interpretation, smooth latent space, ability to sample new data — is a consequence of optimising this single loss.

§ 17Encoder outputs μ and log σ²

The encoder of a VAE has two output heads instead of one. Each head produces a vector of size z_dim — one for the mean μ, one for the log-variance log σ². (We predict log-variance, not variance, because log-space is unbounded above and below — easier to learn and numerically safer.)

# Two output heads in PyTorch class VAEEncoder(nn.Module): def __init__(self, x_dim, hidden_dim, z_dim): super().__init__() self.body = nn.Linear(x_dim, hidden_dim) self.mu_head = nn.Linear(hidden_dim, z_dim) # predicts μ self.logvar = nn.Linear(hidden_dim, z_dim) # predicts log σ² def forward(self, x): h = F.relu(self.body(x)) return self.mu_head(h), self.logvar(h)

§ 18The reparameterization trick

Here is the trick that makes the whole thing trainable. After the encoder produces μ and σ², we need to sample z ~ N(μ, σ²) and pass it to the decoder. But sampling is stochastic, and gradients cannot flow through stochastic operations. We'd get stuck.

The reparameterization trick "factors out" the randomness. Instead of sampling from N(μ, σ²) directly, we sample ε from a standard normal and transform it deterministically with μ and σ.

Bad (not differentiable): z ~ N(μ, σ²)
Good (differentiable): ε ~ N(0, I) ; z = μ + σ · ε The randomness is now in ε, which has no parameters. μ and σ are deterministic functions of x, so gradients flow back through them.

μ1.50

where to centre z.

σ0.80

how spread out z is.

Action

draw a fresh ε and watch it transform.

Each click draws a new ε from N(0, I) — the noise drops in from above, gets shifted right by μ, stretched by σ, and lands on the green axis as z.

This isn't a numerical hack. It's a structural change to the computation graph: stochasticity is pushed outside the differentiable region. μ and σ are deterministic functions of x, so backpropagation works perfectly — gradients flow from the loss, through z, through μ and σ, back into the encoder weights, all while a single sample of ε sits placidly off to the side as input data.

# The reparameterization trick in PyTorch — four lines def reparameterize(self, mu, logvar): std = torch.exp(0.5 * logvar) eps = torch.randn_like(std) z = mu + std * eps return z

§ 19Sampling, then decoding

Once z is drawn via reparameterization, the decoder runs exactly as in a standard autoencoder: a mirror network that expands z back into a reconstruction. The only thing the decoder "sees" is a single vector; it has no idea that vector came from a distribution rather than a fixed code.

That's a feature, not a bug. The randomness during training is what forces the decoder to be robust: it doesn't get to memorise specific codes, because the encoder will deliver slightly different codes for the same input every time. So the decoder learns to handle the neighbourhood around each input's mean code, not just the mean itself.

§ 20Full forward pass, animated

Now we have all the pieces. Stitched together, a complete VAE forward pass looks like this:

This is the same diagram that lives in every VAE paper — but seeing the signal move through it is the part that usually has to live in your head.

§ 21The two-term loss in practice

The full training loss for a single example is the negative ELBO, which we already saw:

L_recon(x) = ||x − x̂||² (or BCE, depending on output type)
L_KL(x) = −½ · Σ_j ( 1 + log σ²_j − μ²_j − σ²_j )
L_VAE(x) = L_recon(x) + L_KL(x) Two terms, both differentiable, both with closed forms.

The two terms pull in opposite directions. The reconstruction loss wants each q_φ(z | x) to be a tight, distinctive point — easy for the decoder to invert. The KL term wants every q_φ(z | x) to look like the standard normal — overlapping, generic, indistinguishable. The model has to find the compromise.

Reconstruction loss

push codes apart.

KL loss

pull every code toward the standard normal.

Outcome

Working VAE

With both losses active the encoder finds a clever middle ground.

§ 22β-VAE — the regularization knob

The standard VAE weights both loss terms equally. The β-VAE introduces a tunable knob that controls how strongly the KL regulariser is enforced:

L_β-VAE(x) = L_recon(x) + β · L_KL(x) β = 1 recovers the standard VAE.

β > 1 — stronger regularization

Latent dimensions are pushed harder toward independence. In some setups this encourages disentangled representations.

Generated samples look more on-distribution but lose fine detail.

β < 1 — weaker regularization

Reconstructions sharpen, fine-grained detail is preserved.

But the latent space starts looking like a standard AE — clustered, gappy, hard to sample from.

§ 23Sweeping the latent space

Once a VAE is trained, the decoder can be run on any point in latent space — not just codes produced by the encoder. With a 2D latent space, you can visualise the entire learned manifold by walking a grid of z values and decoding each one.

Latent plane (z₀, z₁)

Decoded image f_θ(z)

z₀0.00

z₁0.00

Every point in the latent plane decodes to something. Because the encoder was forced to keep all encoded inputs close to N(0, I), neighbouring points decode to similar images — the manifold is smooth.

This grid-walk is the canonical VAE visualisation. Each axis of the latent space corresponds to some aspect of variation in the training data.

§ 24Interpolation between samples

Pick any two real inputs x₁ and x₂. Encode them to z₁ and z₂. Walk a straight line between them in latent space and decode every point along the way. With a VAE, you get a smooth morph from one input to the other.

Start digit3

pick z₁.

End digit8

pick z₂.

α0.50

z(α) = (1−α)·z₁ + α·z₂.

x1

fθ(z(α))

x2

With a VAE, every intermediate α decodes to a coherent digit-like image.

§ 25VAE vs standard AE — side by side

The same architecture, the same dataset, different objectives. The right column adds the KL term; that's the only difference.

Property	Standard AE	VAE
Encoder output	A single point z = h_φ(x)	Distribution params (μ, σ²) over z
Bottleneck behaviour	Deterministic	Stochastic (z is sampled)
Loss function	Reconstruction only	Reconstruction + KL(q \|\| p)
Latent space shape	Whatever shape gradient descent finds	Pulled toward N(0, I)
Can generate?	No — no defined sampling distribution	Yes — sample z ~ N(0, I), decode
Interpolation	Often produces blur or holes	Smooth, on-manifold
Use case	Compression, anomaly detection	Generation, smooth latent manipulation

§ 26A complete VAE in PyTorch

Everything we've covered, in one self-contained module. Read alongside the relevant section above as a kind of glossary.

import torch import torch.nn as nn import torch.nn.functional as F class VAE(nn.Module): def __init__(self, x_dim=784, hidden=400, z_dim=20): super().__init__() self.fc1 = nn.Linear(x_dim, hidden) self.fc_mu = nn.Linear(hidden, z_dim) self.fc_logvar = nn.Linear(hidden, z_dim) self.fc3 = nn.Linear(z_dim, hidden) self.fc4 = nn.Linear(hidden, x_dim) def encode(self, x): h = F.relu(self.fc1(x)) return self.fc_mu(h), self.fc_logvar(h) def reparameterize(self, mu, logvar): std = torch.exp(0.5 * logvar) eps = torch.randn_like(std) return mu + std * eps def decode(self, z): h = F.relu(self.fc3(z)) return torch.sigmoid(self.fc4(h)) def forward(self, x): mu, logvar = self.encode(x.view(-1, 784)) z = self.reparameterize(mu, logvar) x_hat = self.decode(z) return x_hat, mu, logvar def vae_loss(x_hat, x, mu, logvar, beta=1.0): recon = F.binary_cross_entropy(x_hat, x.view(-1, 784), reduction='sum') kl = -0.5 * torch.sum(1 + logvar - mu.pow(2) - logvar.exp()) return recon + beta * kl

§ 27Applications & further reading

Where VAEs are used in practice

Domain	What the VAE provides
Image generation	Smooth manifold of faces, digits, characters; precursor to diffusion models.
Latent diffusion models	Stable Diffusion runs diffusion in a compressed latent space — that space is a VAE.
Anomaly detection	Reconstruction error flags samples that don't fit the learned distribution.
Drug & molecule design	A VAE over SMILES strings lets you search continuously for molecules.
Single-cell genomics	scVI models RNA-seq counts via VAE with zero-inflated negative binomial.
Speech & music	Hierarchical VAEs underpin modern generative audio.

VAE variants worth knowing

Variant	Idea
β-VAE	Multiply the KL term by β; β > 1 encourages disentangled representations.
Conditional VAE	Encoder and decoder receive a conditioning signal y for controllable generation.
VQ-VAE	Replace the continuous latent with a discrete codebook.
NVAE	Hierarchical VAE that competes with GANs on sample quality.
IWAE	Tighter ELBO via multiple importance-weighted samples.
InfoVAE	Replace KL with MMD to dodge posterior collapse.

Primary references used in this post

Kingma & Welling — Auto-Encoding Variational Bayes (2013) — the original paper.
Kingma & Welling — An Introduction to Variational Autoencoders (2019) — book-length.
Matthew N. Bernstein — Variational autoencoders — mathematically careful walkthrough.
Jeremy Jordan — Variational autoencoders — the canonical intuitive blog post.
DataCamp — Variational autoencoders tutorial.
GeeksforGeeks — Variational autoencoders.

The VAE is one of the cleanest examples in deep learning of an algorithm whose structure emerges logically from a probabilistic objective.

Part III

Reference & practice

Algorithm flows broken into numbered steps. Mathematical derivations worked out in full. Interview-style questions with concise answers.

§ 28Algorithm flows, step by step

Every operation in an autoencoder/VAE, written out as numbered procedures. Each step is the smallest meaningful unit — read top-to-bottom and you have the whole algorithm.

ALGORITHM 1

Training a standard autoencoder

Initialise the two networks

Encoder h_φ (input → bottleneck) and decoder f_θ (bottleneck → output). Both with random weights. Bottleneck dimension is a fixed hyperparameter (e.g. 32).

Sample a minibatch

Draw a batch {x₁, ..., x_B} from the training set. Same data both as input and as target.

Encode

Compute z_i = h_φ(x_i) for every example. z_i is a single vector in latent space.

Decode

Compute x̂_i = f_θ(z_i) — the reconstruction.

Compute reconstruction loss

MSE for continuous, BCE for [0, 1] outputs:

L = (1/B) · Σ_i ‖x_i − x̂_i‖²

Backpropagate

Gradients flow from L backward through the decoder, then through the bottleneck, then through the encoder.

Update weights

φ ← φ − η · ∇_φ L, θ ← θ − η · ∇_θ L. Adam, AdamW, or SGD — pick your favourite optimiser.

Repeat until convergence

Loop steps 2-7 over the dataset for the desired number of epochs. Stop when validation loss plateaus.

ALGORITHM 2

Training a variational autoencoder

Initialise the networks

Encoder body shared, plus two output heads producing μ_φ and log σ²_φ. Decoder f_θ. Pick latent dimension z_dim.

Sample a minibatch

Draw {x₁, ..., x_B}. Both input and reconstruction target.

Encode to distribution parameters

For every x_i the encoder outputs two vectors:

μ_i, log σ²_i = h_φ(x_i)

Reparameterize

Sample ε_i ~ N(0, I) independently per example, then compute:

z_i = μ_i + exp(½ · log σ²_i) · ε_i

This is the only stochastic step. ε is an input, not a parameter — gradients pass through μ and σ cleanly.

Decode

Compute the reconstruction x̂_i = f_θ(z_i).

Compute reconstruction loss

L_recon = ‖x_i − x̂_i‖² (or BCE)

Compute the KL term

Closed-form between N(μ, σ²I) and N(0, I):

L_KL = −½ · Σ_j (1 + log σ²_j − μ²_j − σ²_j)

Combine the two losses

L = L_recon + β · L_KL

β = 1 recovers the standard VAE.

Backpropagate & update

Gradients flow: L → decoder → z → (μ, log σ²) → encoder body. ε sits to the side as a constant input.

Repeat

Loop steps 2-9. Watch both L_recon and L_KL — they should both decrease, then trade off.

ALGORITHM 3

The reparameterization trick (computational view)

Given μ, log σ² from the encoder

Both are differentiable functions of x and φ. The goal: sample z ~ N(μ, σ²) in a way that lets gradients flow.

Convert log σ² to σ

σ = exp(½ · log σ²)

Predicting log-variance is numerically safer.

Draw fresh noise

ε ~ N(0, I)

Same shape as z. No parameters depend on ε.

Affine combine

z = μ + σ · ε

By construction z ~ N(μ, σ²). The whole expression is differentiable in μ and σ.

Gradient flow

During backprop, ∂z/∂μ = 1 and ∂z/∂σ = ε. Both well-defined.

ALGORITHM 4

Deriving the ELBO from log p_θ(x)

Start with the log marginal likelihood

log p_θ(x) = log ∫ p_θ(x, z) dz

Intractable as-is because of the integral over z.

Introduce a variational distribution q_φ(z | x)

Multiply and divide by q_φ(z | x) inside the integral.

log p_θ(x) = log ∫ q_φ(z|x) · [p_θ(x, z) / q_φ(z|x)] · dz

Apply Jensen's inequality

log is concave, so log E[Y] ≥ E[log Y]:

log p_θ(x) ≥ E_q[log p_θ(x, z) − log q_φ(z | x)]

Factorize the joint

p_θ(x, z) = p_θ(x | z) · p(z)

Regroup terms

ELBO = E_q[log p_θ(x | z)] + E_q[log p(z) − log q_φ(z | x)]

Recognize the KL divergence

ELBO = E_q[log p_θ(x | z)] − KL(q_φ(z | x) ‖ p(z))

Interpret the two terms

First term = reconstruction. Second term = regulariser. Maximize ELBO ⇔ minimise VAE loss.

ALGORITHM 5

Generating new samples from a trained VAE

Discard the encoder

For generation you only need f_θ.

Sample from the prior

z ~ N(0, I)

Decode

x̂ = f_θ(z)

Repeat for as many samples as you need

Each fresh z draws a fresh sample.

ALGORITHM 6

Computing KL(q_φ(z|x) ‖ N(0, I)) in closed form

Inputs from the encoder

A vector of means μ ∈ ℝ^J and a vector of log-variances log σ² ∈ ℝ^J.

Per-dimension contribution

kl_j = −½ · (1 + log σ²_j − μ²_j − σ²_j)

Sum across dimensions

KL = Σ_j=1^J kl_j

Average across the batch

Mean over examples in the minibatch. Add to the reconstruction loss.

§ 29Mathematical derivations — Q&A

Worked-out questions that appear in textbooks, PhD qualifying exams, and "explain it on a whiteboard" interviews.

Derive the ELBO starting from log p_θ(x).

Derivation

Using the identity log p_θ(x) = log p_θ(x) · 1 with 1 = ∫ q_φ(z | x) dz:

The first term is the ELBO. The second is a non-negative KL. So:

log p_θ(x) = ELBO(φ, θ; x) + KL(q_φ(z | x) ‖ p_θ(z | x))

Consequences: (i) ELBO ≤ log p_θ(x). (ii) Maximising ELBO over φ tightens the bound. (iii) Joint maximisation over (φ, θ) is reasonable approximate maximum likelihood.

Derive the closed-form KL between q = N(μ, σ²·I) and p = N(0, I).

Derivation

For univariate q = N(μ, σ²) and p = N(0, 1):

KL(q ‖ p) = E_q[log q(z)] − E_q[log p(z)]

For a Gaussian, E_q[log q(z)] = −½ · (log(2πσ²) + 1). And E_q[log p(z)] = −½·log(2π) − ½·(μ² + σ²) using E[z²] = μ² + σ². Subtract:

KL = −½ · (1 + log σ² − μ² − σ²)

For J-dimensional diagonal q, sum over dimensions:

KL = −½ · Σ_j=1^J (1 + log σ²_j − μ²_j − σ²_j)

Why is sampling z ~ N(μ, σ²) directly not differentiable, but z = μ + σ · ε is?

Answer

Direct sampling: z comes from a random oracle, not a function. There's no analytical map between (μ, σ) and the observed z to differentiate.

Reparameterized: z = μ + σ·ε with ε ~ N(0, 1) is a deterministic function of (μ, σ, ε). The random oracle was called before we touched μ and σ.

∂z / ∂μ = 1, ∂z / ∂σ = ε

Both well-defined, so gradients flow from L through z back into μ and σ, then into encoder weights φ.

Show that maximising the ELBO is equivalent to minimising KL(q_φ(z|x) ‖ p_θ(z|x)) when θ is fixed.

Derivation

From Q1: log p_θ(x) = ELBO + KL(q_φ(z|x) ‖ p_θ(z|x)).

For fixed θ, the LHS is constant. So increasing ELBO by Δ must decrease the KL gap by exactly Δ. Therefore:

arg max_φ ELBO = arg min_φ KL(q_φ(z|x) ‖ p_θ(z|x))

This is what makes the ELBO a tractable surrogate.

Why predict log σ² rather than σ?

Answer

(i) Positivity for free. Variance must be positive. log-space removes the constraint.

(ii) Numerical stability. log σ² ∈ (−∞, ∞); tiny variances become moderate negative numbers instead of vanishing floats.

(iii) Cleaner KL term. The closed-form KL contains log σ² directly.

Implementation: σ = exp(½ · log σ²) when σ is needed.

Why does the reparameterized gradient have lower variance than the score-function (REINFORCE) gradient?

Answer

Reparameterized: ∇_φ L = E_ε[ ∇_φ ℓ(x, x̂(ε, μ_φ, σ_φ)) ]. Differentiates the loss function.

Score-function: ∇_φ L = E_{z ~ q_φ}[ ℓ(x, z) · ∇_φ log q_φ(z|x) ]. Multiplies the gradient of log q by the raw loss value — which is noisy.

In practice the reparameterized version has 1–3 orders of magnitude lower variance, which is why VAE training works with just one Monte Carlo sample per data point.

Compute E[z²] for z ~ N(μ, σ²).

Derivation

Var(z) = E[z²] − (E[z])²
⇒ E[z²] = Var(z) + (E[z])² = σ² + μ²

This identity is the workhorse of the KL derivation in Q2.

Why is KL asymmetric? What happens if we use KL(p ‖ q) instead?

Answer

KL is asymmetric because q · log(q/p) weights by q's density, not p's.

Reverse KL — KL(q ‖ p) — "mode-seeking". q concentrates on a single mode of p. This is what VAEs use.
Forward KL — KL(p ‖ q) — "mean-seeking". q spreads out to cover everything p has mass on.

Reverse KL falls out naturally from the ELBO. Using forward KL would be intractable — it would require sampling from the true posterior, which is what we don't know how to do.

If the decoder distribution is Gaussian with fixed variance, what does the reconstruction loss reduce to?

Derivation

For p_θ(x | z) = N(f_θ(z), σ²_dec · I):

log p_θ(x | z) = const − (1 / (2σ²_dec)) · ‖x − f_θ(z)‖²

So arg max log p_θ(x | z) = arg min ‖x − f_θ(z)‖² — mean-squared error.

σ²_dec becomes an implicit β-like weight between reconstruction and KL.

Q10

For binary pixel data, derive the BCE form of the reconstruction loss.

Derivation

For p_θ(x | z) = ∏_i Bernoulli(x_i; π_i) with π = f_θ(z):

log p_θ(x | z) = Σ_i [ x_i · log π_i + (1 − x_i) · log(1 − π_i) ]

Negating gives binary cross-entropy:

L_BCE = −Σ_i [ x_i · log π_i + (1 − x_i) · log(1 − π_i) ]

Q11

What happens to the ELBO at β = 0 and at β → ∞?

Answer

β = 0: KL disappears. Encoder collapses to a Dirac delta. Model becomes a standard autoencoder. Generation from N(0, I) produces garbage.

β → ∞: Only KL matters. Optimal q is the prior, regardless of x — posterior collapse. Reconstruction drops to the level of decoding pure noise.

Sweet spot: β ∈ [0.1, 4] depending on dataset and decoder capacity.

Q12

Express the per-dimension KL contribution at μ = 0, σ = 1.

Derivation

kl = −½ · (1 + log 1 − 0 − 1) = 0

Zero when q matches the prior exactly — the attractor that the regulariser pulls every q(z|x) toward.

§ 30Interview questions — quick answers

The conceptual questions interviewers ask. Short, complete answers — the kind you'd give in 60 seconds at a whiteboard.

i.01

In one sentence, what's the difference between an autoencoder and a variational autoencoder?

An autoencoder maps each input to a point in latent space; a VAE maps each input to a distribution over latent space and adds a KL regulariser that pulls those distributions toward a fixed prior — which is what makes the latent space sampleable for generation.

i.02

Why can a VAE generate but a standard AE cannot?

Because the VAE's KL term forces every q(z|x) to look like a known distribution — typically N(0, I). So you can sample fresh z ~ N(0, I) at test time and feed it to the decoder, and that z will land in the same region of latent space the decoder was trained on. With a standard AE you have no idea what distribution to sample from.

i.03

What is amortized inference, and why does the VAE use it?

Instead of learning a separate posterior q_i(z) for each data point, a single shared neural network q_φ(z | x) is trained to output posterior parameters for any input. You "amortize" the cost of inference over the whole dataset by paying once to train one network.

i.04

What is posterior collapse? How do you detect it and fix it?

What: the encoder collapses to q(z|x) ≈ p(z) for every x — KL ≈ 0, reconstruction stays bad.

Detect: log per-dimension KL during training. If near-zero with poor reconstruction, that's collapse.

Fix: KL annealing; free bits; weaker decoder; InfoVAE/MMD; richer posterior (normalizing flows).

i.05

Why are VAE samples often blurry?

Two reasons. (i) Pixel-wise Gaussian/Bernoulli decoders optimise mode-averaging losses — the optimal MSE answer when multiple valid reconstructions exist is the pixel-wise mean (blurry). (ii) Reparameterized sampling adds noise; the decoder smooths to be robust. Fixes: richer output distributions (PixelCNN), perceptual/adversarial losses, VQ-VAE with autoregressive priors.

i.06

VAE vs GAN — when do you prefer each?

VAE: meaningful latent space (interpolation, anomaly detection), probabilistic interpretation, easy training, posterior access. GAN: maximum visual quality, no explicit likelihood needed. Modern hybrids (VQ-VAE-2, VQ-GAN) take the best of both.

i.07

VAE vs Diffusion models — how are they related?

A diffusion model can be viewed as a hierarchical VAE with a fixed encoder (forward noising) and a learned decoder (denoising network) at every timestep. Both optimise a variational lower bound. Stable Diffusion uses a VAE to compress to a small latent space, then runs diffusion inside it.

i.08

Why is the diagonal-covariance Gaussian assumption okay for q(z | x)?

It assumes latent dimensions are conditionally independent given x. Loss in expressiveness is small for typical data, gain in tractability is huge: covariance is a vector (J floats) instead of a matrix (J²). For richer posteriors, switch to normalizing flows or full covariance.

i.09

What does a Conditional VAE add?

Both encoder and decoder receive an extra conditioning input y (e.g. a class label). Loss: L = L_recon(x, x̂ | y) + KL(q(z | x, y) ‖ p(z | y)). Sample z ~ p(z | y) at generation time and decode with the desired y — controllable generation.

i.10

If I increase the latent dimension z_dim from 2 to 100, what changes?

Reconstruction improves (more channels through bottleneck). KL grows linearly with z_dim. Visualisation gets harder. Per-dim KL becomes unevenly used — some dimensions encode structure, others stay at the prior.

i.11

Can VAEs work with discrete latent variables?

Not directly with reparameterization. Workarounds: Gumbel-Softmax (continuous relaxation); VQ-VAE (vector-quantize z, straight-through gradients); REINFORCE (high variance). VQ-VAE became dominant.

i.12

How would you debug a VAE that produces only the average of all training images?

Posterior collapse with strong KL. Checklist: (1) Inspect per-dim KL — lower β or use annealing if all zero. (2) Check decoder's z-weights — if near zero, decoder ignores z. (3) Reduce decoder capacity. (4) Use free bits. (5) Try a richer posterior.

i.13

What if we chose a different prior than N(0, I)?

Mixture-of-Gaussians prior: better for multi-modal data. VampPrior: learned mixture, often outperforms standard. Normalizing flow prior: arbitrarily flexible. Standard normal is the default because closed-form KL is free.

i.14

Walk through a single training step at the tensor level.

Batch x: [B, 784]. (1) Encoder body: [B, 784] → [B, 256]. (2) Two heads: [B, 256] → [B, 20] each. (3) Sample ε: [B, 20]. (4) z = μ + exp(½·log σ²)·ε. (5) Decoder: [B, 20] → [B, 784], sigmoid. (6) BCE summed across pixels, KL summed across dims, both mean over batch. (7) Backward, step.

i.15

What is the "evidence" in "evidence lower bound"?

The evidence is the marginal likelihood of the data: p_θ(x) = ∫ p_θ(x, z) dz. ELBO is a tractable function that is a lower bound on the log of that evidence.

i.16

How would you use a VAE for anomaly detection?

Train on normal data only. At test time: (i) high reconstruction loss; (ii) low likelihood under prior (encoded μ far from origin); (iii) high KL. Combine or use just (i) for a clean threshold. Works for fraud, defect inspection, network intrusion.