Deep learning · Generative models

Autoencoders & variational
autoencoders, animated.

A working tour of every operation inside an autoencoder — encoder, bottleneck, decoder, distributions, the reparameterization trick, KL divergence, and the ELBO — built so each step can be watched and replayed.

14 interactive demos, every step animated Approx. 50 min read Click any widget to play / drag any slider
Part I
Autoencoders
A neural network that squeezes itself: copy your input back out, but pass it through a deliberately narrow middle. What survives the squeeze is the model's idea of "important".

§ 01Why compress at all?

A 28×28 grayscale image of a hand-written digit lives in a 784-dimensional space — one number per pixel. But the set of valid digits is much, much smaller than 784 dimensions allow. The space of "looks-like-a-real-digit" is a thin, curved manifold inside a vast volume of pixel noise.

An autoencoder's job is to find the intrinsic dimensionality of your data. If valid digits really only need ten or twenty meaningful axes of variation, an autoencoder should be able to learn those axes from data alone.

If you can compress a digit down to, say, 2 numbers without losing what makes it a digit, three useful things follow. First, you have a visualization — paint each digit as a point in 2D. Second, you have a compression scheme — store 2 floats instead of 784. Third, and most interestingly, you have an implicit model of the data: pick any 2 numbers, decode them, and (with the variational version we will build in Part II) something digit-like comes out.

Demo 01 · The pixel space vs the digit manifold
Drag the slider to morph from a real digit toward random pixel noise — both points live in the same 784-D space.
Step into noise0%
at 0% you are on the digit manifold; at 100% you are at a random point in pixel space.
The "valid-digit" manifold is a vanishingly thin sliver of pixel space. An autoencoder learns coordinates along this sliver and discards everything else.

This is the central premise: meaningful data lives on a low-dimensional manifold inside a high-dimensional ambient space, and a neural network can learn that manifold by being forced to bottleneck.

§ 02Anatomy of an autoencoder

Three pieces, in order: an encoder hφ that maps an input x down to a low-dimensional code z; a bottleneck where the entire input is forced to live in that compressed code; and a decoder fθ that maps z back up to a reconstruction in the original space. The two networks are trained jointly to minimize the difference between x and .

x  →  hφ(x)  →  z  →  fθ(z)  →  x̂ Two neural nets stacked back-to-back, with a forced narrow waist in the middle.
Demo 02 · The full architecture, animated
A single input flows through every layer. Press play to send a pulse through the network.
Step 0 / 6
Press play. A signal will travel left → right through the encoder, contract into the bottleneck, then expand back through the decoder.

The widget above is the mental model for every autoencoder variant in this post. Everything that follows — denoising, sparse, variational — only changes what goes into the bottleneck or how it is regularized. The skeleton is constant.

§ 03The encoder, layer by layer

Each encoder layer is the standard machinery: a learned linear projection, a bias, and a non-linearity (often ReLU or tanh). The dimensionality shrinks from layer to layer — for MNIST a typical descent is 784 → 256 → 64 → z_dim. For images, the linear layers are often replaced by strided convolutions that combine spatial down-sampling with feature extraction in a single step.

h = σ(W · hℓ−1 + b) σ is a non-linearity. W shrinks the dimensionality at each step.
Demo 03 · Inside the encoder — watch activations flow
Press play to see a single batch pass through. Each layer fires after the previous one finishes.
Layer 0 / 4
Each rectangle is one fully-connected layer; the small circle inside it is the activation magnitude. As the signal passes, layers light up in sequence and the bottleneck collects what survived.

The encoder is doing lossy compression by gradient descent. Whatever survives the bottleneck must reconstruct the input well enough that the decoder can do its job. So the encoder learns to keep what matters for reconstruction and throw the rest away.

§ 04The bottleneck

The bottleneck is just a vector — typically a handful of floats. For a typical image autoencoder, z ∈ ℝ32 or z ∈ ℝ64. For visualization-oriented examples (like the ones in this post) you can squeeze all the way to z ∈ ℝ2 so you can see the latent space on a flat plot.

Demo 04 · Pick a bottleneck size
Slide to change how much the encoder is allowed to keep. Watch the reconstruction quality (approximate, illustrative) change in step.
Bottleneck size8
dimensions allowed through the narrow waist.
Reconstruction qualitygood
approximate, modelled after MNIST behaviour.
Original
Reconstruction at this bottleneck
With a generous bottleneck the network can store the image almost verbatim; with a tight bottleneck it has to keep only the most informative axes and reconstruct the rest from learned priors.

The bottleneck size is a deliberate design choice. Too wide and you risk the network just memorising the input (the identity function is trivial when input and output dimensions match). Too narrow and reconstruction suffers. The "interesting" range is typically 10–100× narrower than the input.

§ 05The decoder

The decoder is the encoder run in reverse — a series of layers that expand the code z back up to the original input shape. For image data, transposed convolutions are common; for tabular or flattened input, plain dense layers suffice.

x̂ = fθ(z) = σ(W'L · σ(W'L−1 · ... σ(W'1 · z + b'1)...) + b'L) A mirror of the encoder. Final activation is usually sigmoid for [0,1] pixel data, or linear for unbounded outputs.

One trick worth noting: the encoder and decoder do not need to be perfect mirrors of each other. Many implementations use a deeper decoder than encoder, or share no weights at all between them. They only need to compose well into something that returns the input.

§ 06Reconstruction loss

An autoencoder is trained by minimizing how different the output is from the input. For continuous data, the canonical choice is mean-squared error; for binary or [0,1]-bounded pixel data, binary cross-entropy is more common.

LMSE = (1/N) · Σi ||xi − fθ(hφ(xi))||²
LBCE = −Σi,j [ xij log x̂ij + (1 − xij) log (1 − x̂ij) ] MSE assumes Gaussian observation noise; BCE assumes Bernoulli (each pixel is "on or off" with probability x̂).

Both losses are pixel-wise. They treat each output position independently and ask: "how close is the predicted value to the target value?" This is also one of the limitations of vanilla autoencoders for image data — pixel-wise loss is blind to perceptual mismatch (a one-pixel shift is "wrong everywhere" by MSE but obviously fine to a human). Modern systems often supplement with perceptual or adversarial losses, but for our purposes pixel-wise loss is enough.

§ 07Training loop, animated

Training is the standard SGD ritual: forward pass, compute loss, backward pass, update weights. The only special thing about an autoencoder is that the input and the target are the same tensor — you're not supervising against external labels, you're supervising the model against itself.

Demo 05 · Training over time
A toy model trains live — watch the loss fall and the reconstruction sharpen.
Epoch0
progress through 30 simulated epochs.
Reconstruction loss
average MSE per pixel (simulated).
Target
Current reconstruction
# Vanilla autoencoder training loop in PyTorch for epoch in range(num_epochs): for batch in dataloader: x = batch[0] # input doubles as target x_hat = model(x) # forward through encoder→bottleneck→decoder loss = F.mse_loss(x_hat, x) # pixel-wise reconstruction optimizer.zero_grad() loss.backward() # gradients flow back through both networks optimizer.step() # update φ and θ together

§ 08The latent space of an autoencoder

Train an autoencoder with a 2D bottleneck on MNIST, then plot every training image's code on the plane. The picture you get is informative but ugly: clusters of digits in roughly the right places, but with holes between them, irregular shapes, and arbitrary scales.

Demo 06 · Latent space of a standard autoencoder
Each point is a training image, colored by its digit class. Hover any point — that's the code the encoder assigned to one example.
The problem. If you pick a point in the white space between clusters and decode it, you don't get a clean digit — you get a blur. The encoder never placed anything there, so the decoder has no idea what should live there.

This irregular layout is a perfectly fine outcome if you only want to reconstruct or compress. But it's a problem the moment you want to generate new samples by picking a random latent point and decoding it — most random points will land in the "no man's land" between clusters and produce nonsense. Section §11 onwards solves exactly this problem.

§ 09Useful variants — sparse, denoising, contractive

Before VAEs, the field invented several flavours of autoencoder that add structure to the latent code without going fully probabilistic.

Sparse autoencoders

Add an L1 penalty on the bottleneck activations: L = L_recon + λ · Σ |z_i|. Most bottleneck dimensions are pushed to zero for any given input, leaving only a small active subset. This is useful for feature discovery — each non-zero dimension is forced to specialise in one type of input pattern.

Denoising autoencoders

Corrupt the input with noise before feeding it in, but ask the network to reconstruct the clean input. The network can't simply memorise — it has to learn what's "essential" enough to survive noise.

x̃ = x + ε  ·  L = ||x − fθ(hφ())||² Network sees x̃ but is graded against x. Forces robustness to corruption.

Contractive autoencoders

Penalise the Jacobian of the encoder with respect to its input. Concretely: L = L_recon + λ · ||∂h/∂x||²_F. This makes the encoder less sensitive to small input changes — nearby inputs map to nearby codes. A different way to encourage smooth latent structure.

All three share

The same architecture as a standard autoencoder, plus a regularizer.

An interpretable latent code where individual dimensions or local neighbourhoods carry meaning.

None of them solve generation

Even with all these tricks, the latent space remains discrete clusters in a continuous space. There's no principled way to sample from it.

That gap — between learning a useful code and being able to generate from it — is exactly the gap variational autoencoders close.

§ 10Why a standard autoencoder cannot generate

Three reasons, stacked on top of each other.

1. The latent space has no known distribution. You have no idea what shape it takes. So you don't know how to sample from it. Try a uniform distribution and you'll cover regions the encoder never visited; try a Gaussian and most samples will still miss the actual data manifold.

2. The latent space has gaps. Real points cluster around regions the encoder used; the in-between is undefined behaviour. Decoding a random point lands in undefined territory.

3. The latent space is not continuous in any guaranteed sense. Two nearby codes might decode to very different outputs because nothing in the loss function encouraged smoothness.

Variational autoencoders address all three of these in one move: they shape the latent space to be a known, smooth, continuous distribution by construction.

Part II
Variational autoencoders
Same skeleton — encoder, bottleneck, decoder — but the bottleneck now produces a distribution, not a point. The whole architecture is forced to play nicely with probability.

§ 11From points to distributions

In a standard autoencoder, the encoder maps an input to one specific point z in latent space. A variational autoencoder makes the encoder output the parameters of a distribution instead — typically a Gaussian with mean μ and standard deviation σ. Then z is sampled from that distribution before being passed to the decoder.

Standard AE:   x  →  z = hφ(x)
VAE:   x  →  μ, σ = hφ(x)  →  z ~ N(μ, σ²) A single point is replaced by a recipe for sampling points.
Demo 07 · A point becomes a fuzzy cloud
Move the slider to switch from AE-style (single point) to VAE-style (distribution). Each tiny dot is one sample drawn from the encoder's output distribution.
σ (spread of the distribution)0.30
at σ=0 the VAE collapses to a standard AE; at large σ the encoder is "uncertain" — one input maps to a region.
Forcing the encoder to commit to a region rather than a point is what makes the latent space smooth. Neighbours in the input map to overlapping Gaussians in latent space — and that overlap is what the decoder learns to interpolate across.

That single change cascades through everything. The loss function changes (we now need to ensure the distributions themselves are well-behaved). The forward pass changes (we need to sample, which is non-differentiable in its naive form). The interpretation changes (the encoder is now doing approximate Bayesian inference). The rest of this part walks through each of those consequences in order.

§ 12VAE as a generative story

Before diving into the inference machinery, here is the imagined data-generating process that the VAE assumes — the story the model believes about how the data was produced.

1.   z ~ p(z) = N(0, I)   // pick a low-dim point from a standard normal
2.   x ~ pθ(x | z)   // pass it through a neural net fθ, then sample Two steps. Sample a latent. Decode it through a parametric distribution.

Step 2 is just our decoder: fθ(z) produces the parameters of pθ(x | z) — for image data, usually the mean of a Gaussian or the probabilities of independent Bernoullis. Step 1 is the crucial piece: the latent variable is assumed to come from a standard Gaussian. That's our generative prior.

Why this is powerful
Even though p(z) is simple (a standard Gaussian) and pθ(x | z) may also be simple (an independent Gaussian per pixel), the non-linear mapping fθ can warp them into an arbitrarily complex pθ(x). That's how a VAE fits curvy, multi-modal distributions while only ever working with simple distributions internally.

If we believed this story exactly, training would be straightforward: just do maximum likelihood. Find the θ that maximises pθ(x1, ..., xn). The problem, as the next section shows, is that computing that probability requires solving an intractable integral. The whole machinery of variational inference exists to dodge that integral.

§ 13The intractable posterior

Given a fixed decoder θ, two things would be nice to know for each observed x:

  1. The posterior pθ(z | x) — given this observation, what latent z probably produced it?
  2. The marginal likelihood pθ(x) — how probable is this observation under the model? (Maximising this is maximum likelihood.)

By Bayes' theorem, the posterior is:

pθ(z | x) = pθ(x | z) · p(z) / pθ(x)
pθ(x) = ∫ pθ(x | z) · p(z) · dz The denominator marginalizes z over the full latent space — a multi-dimensional integral with no closed form for any non-trivial fθ.

For a latent space of even modest dimensionality (50 dims, say), this integral has no closed form and Monte Carlo estimation would need an astronomical number of samples. This is the wall. Bayesian inference on a neural decoder is intractable by direct attack.

Variational inference is the workaround. Instead of computing the posterior, we approximate it with a simpler distribution qφ(z | x) that we deliberately design to be tractable — typically a Gaussian whose parameters are output by a neural network.

§ 14The variational family

Pick a family of distributions Q we can sample from and evaluate easily. Inside that family, find the specific distribution that best approximates the true posterior. For VAEs, the standard choice is a diagonal-covariance Gaussian:

qφ(z | x) = N(μφ(x), diag(σ2φ(x))) μ and σ² are themselves outputs of a neural network — the encoder.

"Diagonal" means we assume the latent dimensions are independent given the input — a simplifying assumption that lets us describe the covariance with a single vector rather than a full matrix. It's not strictly true (latent dimensions might correlate), but the trade-off between expressiveness and tractability is worth it.

One vital subtlety: amortized inference. We could have given each data point its own φi — a separate approximate posterior per example — but we don't. We share one neural network across all data points, mapping each xi to its own i, σi). This is what makes inference fast and scalable; it's the whole reason VAEs are practical.

Demo 08 · Building the approximate posterior
Each input x produces its own (μ, σ²). Drag the input to see the encoder's output distribution move.
Pick input x3
choose which of 10 example digits we are encoding.
Encoder uncertaintymoderate
how confident the encoder is — small σ for sharp, large σ for fuzzy.
Each input maps to its own little Gaussian in latent space. Inputs that look similar end up with overlapping Gaussians — and that overlap is what eventually gives the VAE smooth interpolation.

§ 15KL divergence, visualized

The recipe for "best approximation" needs a notion of closeness between distributions. Variational inference uses Kullback-Leibler divergence:

KL(q || p) = ∫ q(z) · log [ q(z) / p(z) ] · dz = Ez ~ q[ log q(z) − log p(z) ] Asymmetric. Zero when q = p. Always non-negative. Penalizes q for putting mass where p has none.

Two facts to keep in mind. First, KL divergence is not symmetric: KL(q || p) ≠ KL(p || q). The order matters. VAEs use KL(qφ(z|x) || p(z)), which is sometimes called "mode-seeking" because it harshly penalises q for placing mass where p has none. Second, KL between two diagonal Gaussians has a clean closed-form expression — no integral required.

Demo 09 · KL divergence between two Gaussians
Drag the parameters of the green distribution (q) toward the dashed prior (p). KL drops as they align.
μ of q2.0
mean of the variational distribution.
σ of q1.5
standard deviation of the variational distribution.
KL(q || N(0,1))
closed form for two Gaussians
Mean penalty
μ² / 2
Variance penalty
(σ² − 1 − log σ²) / 2
Presets

For a diagonal-Gaussian q and a standard-Gaussian prior, the KL term collapses to a sum over latent dimensions:

KL(qφ(z | x) || N(0, I)) = −½ · Σj ( 1 + log σ²j − μ²j − σ²j ) No integral. No Monte Carlo. A simple algebraic expression in terms of the encoder's outputs.

This is the term that pulls every qφ(z|x) toward the standard normal. Without it, the encoder would happily place each input far from every other input (zero KL constraint, easy reconstruction). With it, the encoder has to share a common region of latent space — which is precisely what makes the latent space well-behaved for generation.

§ 16The ELBO — Evidence Lower Bound

Putting it all together. We want to maximise the log marginal likelihood log pθ(x), which we cannot compute. The trick is a chain of algebra that introduces qφ and decomposes the log-likelihood:

log pθ(x) = ELBO(φ, θ) + KL( qφ(z | x) || pθ(z | x) )
ELBO(φ, θ) := Ez ~ qφ[ log pθ(x | z) ] − KL( qφ(z | x) || p(z) ) Because KL is non-negative, the ELBO is a lower bound on log pθ(x). Maximize the bound, you push up the likelihood.

Read the second line as two opposing forces. The first term rewards the model when the decoder, fed a sample from qφ(z | x), reproduces x. This is the reconstruction term. The second term penalises the model when qφ drifts away from the prior p(z). This is the regulariser.

LVAE(x) = −Ez ~ qφ[ log pθ(x | z) ]   +   KL( qφ(z | x) || p(z) ) The negative ELBO. Minimizing this is equivalent to maximizing the ELBO.

This is the entire VAE training objective. Every term is differentiable (after one more trick, in §18). The architecture is just two neural networks. Everything else — generative interpretation, smooth latent space, ability to sample new data — is a consequence of optimising this single loss.

§ 17Encoder outputs μ and log σ²

The encoder of a VAE has two output heads instead of one. Each head produces a vector of size z_dim — one for the mean μ, one for the log-variance log σ². (We predict log-variance, not variance, because log-space is unbounded above and below — easier to learn and numerically safer.)

Demo 10 · Two-headed encoder
A standard encoder body, then two parallel output heads. Press play to send one input through.
Step 0 / 5
The encoder forks into two heads. The green head outputs the centre of the latent Gaussian; the amber head outputs how spread-out it is.
# Two output heads in PyTorch class VAEEncoder(nn.Module): def __init__(self, x_dim, hidden_dim, z_dim): super().__init__() self.body = nn.Linear(x_dim, hidden_dim) self.mu_head = nn.Linear(hidden_dim, z_dim) # predicts μ self.logvar = nn.Linear(hidden_dim, z_dim) # predicts log σ² def forward(self, x): h = F.relu(self.body(x)) return self.mu_head(h), self.logvar(h)

§ 18The reparameterization trick

Here is the trick that makes the whole thing trainable. After the encoder produces μ and σ², we need to sample z ~ N(μ, σ²) and pass it to the decoder. But sampling is stochastic, and gradients cannot flow through stochastic operations. We'd get stuck.

The reparameterization trick "factors out" the randomness. Instead of sampling from N(μ, σ²) directly, we sample ε from a standard normal and transform it deterministically with μ and σ.

Bad (not differentiable):   z ~ N(μ, σ²)
Good (differentiable):   ε ~ N(0, I) ;   z = μ + σ · ε The randomness is now in ε, which has no parameters. μ and σ are deterministic functions of x, so gradients flow back through them.
Demo 11 · Reparameterization, animated
Watch a sample ε drop in from a standard normal, get shifted by μ, scaled by σ, and emerge as z. This is the entire trick.
μ1.50
where to centre z.
σ0.80
how spread out z is.
Action
draw a fresh ε and watch it transform.
Each click draws a new ε from N(0, I) — the noise drops in from above, gets shifted right by μ, stretched by σ, and lands on the green axis as z.

This isn't a numerical hack. It's a structural change to the computation graph: stochasticity is pushed outside the differentiable region. μ and σ are deterministic functions of x, so backpropagation works perfectly — gradients flow from the loss, through z, through μ and σ, back into the encoder weights, all while a single sample of ε sits placidly off to the side as input data.

# The reparameterization trick in PyTorch — four lines def reparameterize(self, mu, logvar): std = torch.exp(0.5 * logvar) eps = torch.randn_like(std) z = mu + std * eps return z

§ 19Sampling, then decoding

Once z is drawn via reparameterization, the decoder runs exactly as in a standard autoencoder: a mirror network that expands z back into a reconstruction. The only thing the decoder "sees" is a single vector; it has no idea that vector came from a distribution rather than a fixed code.

That's a feature, not a bug. The randomness during training is what forces the decoder to be robust: it doesn't get to memorise specific codes, because the encoder will deliver slightly different codes for the same input every time. So the decoder learns to handle the neighbourhood around each input's mean code, not just the mean itself.

Demo 12 · Many samples, many reconstructions
For one fixed input, draw multiple samples of z and watch the decoder produce slightly different reconstructions.
With small σ all reconstructions look nearly identical. Increase σ and the network starts producing visible variation.

§ 20Full forward pass, animated

Now we have all the pieces. Stitched together, a complete VAE forward pass looks like this:

Demo 13 · The whole VAE in one animation
Press play. A single input travels: encoder body → μ & log σ² → reparameterization → z → decoder → reconstruction → loss.
Step 0 / 8
Press play. Each step lights up one stage of the network and explains what happens there.

This is the same diagram that lives in every VAE paper — but seeing the signal move through it is the part that usually has to live in your head.

§ 21The two-term loss in practice

The full training loss for a single example is the negative ELBO, which we already saw:

Lrecon(x) = ||x − x̂||²   (or BCE, depending on output type)
LKL(x) = −½ · Σj ( 1 + log σ²j − μ²j − σ²j )
LVAE(x) = Lrecon(x) + LKL(x) Two terms, both differentiable, both with closed forms.

The two terms pull in opposite directions. The reconstruction loss wants each qφ(z | x) to be a tight, distinctive point — easy for the decoder to invert. The KL term wants every qφ(z | x) to look like the standard normal — overlapping, generic, indistinguishable. The model has to find the compromise.

Demo 14 · The two losses, fighting
Toggle which loss is active. With only the reconstruction loss you get an unregularized autoencoder; with only the KL loss every input maps to the same prior; with both you get a working VAE.
Reconstruction loss
push codes apart.
KL loss
pull every code toward the standard normal.
Outcome
Working VAE
With both losses active the encoder finds a clever middle ground.

§ 22β-VAE — the regularization knob

The standard VAE weights both loss terms equally. The β-VAE introduces a tunable knob that controls how strongly the KL regulariser is enforced:

Lβ-VAE(x) = Lrecon(x) + β · LKL(x) β = 1 recovers the standard VAE.
Demo 15 · The β-VAE slider
Slide β to watch the trade-off.
β (KL weight)1.0
log-scaled from 0.01 to 100.
Reconstruction
lower is better
KL divergence
lower = closer to prior
Latent regularity
how Gaussian the codes are
β > 1 — stronger regularization

Latent dimensions are pushed harder toward independence. In some setups this encourages disentangled representations.

Generated samples look more on-distribution but lose fine detail.

β < 1 — weaker regularization

Reconstructions sharpen, fine-grained detail is preserved.

But the latent space starts looking like a standard AE — clustered, gappy, hard to sample from.

§ 23Sweeping the latent space

Once a VAE is trained, the decoder can be run on any point in latent space — not just codes produced by the encoder. With a 2D latent space, you can visualise the entire learned manifold by walking a grid of z values and decoding each one.

Demo 16 · Walk the latent plane
Drag the marker on the 2D latent plane (left). The decoder produces a digit-like image at that exact location (right).
Latent plane (z0, z1)
Decoded image fθ(z)
z00.00
z10.00
Every point in the latent plane decodes to something. Because the encoder was forced to keep all encoded inputs close to N(0, I), neighbouring points decode to similar images — the manifold is smooth.

This grid-walk is the canonical VAE visualisation. Each axis of the latent space corresponds to some aspect of variation in the training data.

§ 24Interpolation between samples

Pick any two real inputs x1 and x2. Encode them to z1 and z2. Walk a straight line between them in latent space and decode every point along the way. With a VAE, you get a smooth morph from one input to the other.

Demo 17 · Smooth interpolation
Slide α from 0 to 1 to walk a straight line from one digit to another.
Start digit3
pick z1.
End digit8
pick z2.
α0.50
z(α) = (1−α)·z1 + α·z2.
x1
fθ(z(α))
x2
With a VAE, every intermediate α decodes to a coherent digit-like image.

§ 25VAE vs standard AE — side by side

The same architecture, the same dataset, different objectives. The right column adds the KL term; that's the only difference.

Demo 18 · The qualitative difference, in one frame
Encoded inputs Random sample → decoded Unit circle (prior region)
Same encoder/decoder size. Same training set. Just adding the KL term reshapes the latent space from "useful clusters with gaps" into "useful clusters embedded in a sampleable Gaussian".
PropertyStandard AEVAE
Encoder outputA single point z = hφ(x)Distribution params (μ, σ²) over z
Bottleneck behaviourDeterministicStochastic (z is sampled)
Loss functionReconstruction onlyReconstruction + KL(q || p)
Latent space shapeWhatever shape gradient descent findsPulled toward N(0, I)
Can generate?No — no defined sampling distributionYes — sample z ~ N(0, I), decode
InterpolationOften produces blur or holesSmooth, on-manifold
Use caseCompression, anomaly detectionGeneration, smooth latent manipulation

§ 26A complete VAE in PyTorch

Everything we've covered, in one self-contained module. Read alongside the relevant section above as a kind of glossary.

import torch import torch.nn as nn import torch.nn.functional as F class VAE(nn.Module): def __init__(self, x_dim=784, hidden=400, z_dim=20): super().__init__() self.fc1 = nn.Linear(x_dim, hidden) self.fc_mu = nn.Linear(hidden, z_dim) self.fc_logvar = nn.Linear(hidden, z_dim) self.fc3 = nn.Linear(z_dim, hidden) self.fc4 = nn.Linear(hidden, x_dim) def encode(self, x): h = F.relu(self.fc1(x)) return self.fc_mu(h), self.fc_logvar(h) def reparameterize(self, mu, logvar): std = torch.exp(0.5 * logvar) eps = torch.randn_like(std) return mu + std * eps def decode(self, z): h = F.relu(self.fc3(z)) return torch.sigmoid(self.fc4(h)) def forward(self, x): mu, logvar = self.encode(x.view(-1, 784)) z = self.reparameterize(mu, logvar) x_hat = self.decode(z) return x_hat, mu, logvar def vae_loss(x_hat, x, mu, logvar, beta=1.0): recon = F.binary_cross_entropy(x_hat, x.view(-1, 784), reduction='sum') kl = -0.5 * torch.sum(1 + logvar - mu.pow(2) - logvar.exp()) return recon + beta * kl

§ 27Applications & further reading

Where VAEs are used in practice

DomainWhat the VAE provides
Image generationSmooth manifold of faces, digits, characters; precursor to diffusion models.
Latent diffusion modelsStable Diffusion runs diffusion in a compressed latent space — that space is a VAE.
Anomaly detectionReconstruction error flags samples that don't fit the learned distribution.
Drug & molecule designA VAE over SMILES strings lets you search continuously for molecules.
Single-cell genomicsscVI models RNA-seq counts via VAE with zero-inflated negative binomial.
Speech & musicHierarchical VAEs underpin modern generative audio.

VAE variants worth knowing

VariantIdea
β-VAEMultiply the KL term by β; β > 1 encourages disentangled representations.
Conditional VAEEncoder and decoder receive a conditioning signal y for controllable generation.
VQ-VAEReplace the continuous latent with a discrete codebook.
NVAEHierarchical VAE that competes with GANs on sample quality.
IWAETighter ELBO via multiple importance-weighted samples.
InfoVAEReplace KL with MMD to dodge posterior collapse.
Primary references used in this post

The VAE is one of the cleanest examples in deep learning of an algorithm whose structure emerges logically from a probabilistic objective.

Part III
Reference & practice
Algorithm flows broken into numbered steps. Mathematical derivations worked out in full. Interview-style questions with concise answers.

§ 28Algorithm flows, step by step

Every operation in an autoencoder/VAE, written out as numbered procedures. Each step is the smallest meaningful unit — read top-to-bottom and you have the whole algorithm.

ALGORITHM 1
Training a standard autoencoder
1
Initialise the two networks

Encoder hφ (input → bottleneck) and decoder fθ (bottleneck → output). Both with random weights. Bottleneck dimension is a fixed hyperparameter (e.g. 32).

2
Sample a minibatch

Draw a batch {x1, ..., xB} from the training set. Same data both as input and as target.

3
Encode

Compute zi = hφ(xi) for every example. zi is a single vector in latent space.

4
Decode

Compute i = fθ(zi) — the reconstruction.

5
Compute reconstruction loss

MSE for continuous, BCE for [0, 1] outputs:

L = (1/B) · Σi ‖xi − x̂i‖²
6
Backpropagate

Gradients flow from L backward through the decoder, then through the bottleneck, then through the encoder.

7
Update weights

φ ← φ − η · ∇φ L,  θ ← θ − η · ∇θ L. Adam, AdamW, or SGD — pick your favourite optimiser.

8
Repeat until convergence

Loop steps 2-7 over the dataset for the desired number of epochs. Stop when validation loss plateaus.

ALGORITHM 2
Training a variational autoencoder
1
Initialise the networks

Encoder body shared, plus two output heads producing μφ and log σ²φ. Decoder fθ. Pick latent dimension z_dim.

2
Sample a minibatch

Draw {x1, ..., xB}. Both input and reconstruction target.

3
Encode to distribution parameters

For every xi the encoder outputs two vectors:

μi, log σ²i = hφ(xi)
4
Reparameterize

Sample εi ~ N(0, I) independently per example, then compute:

zi = μi + exp(½ · log σ²i) · εi

This is the only stochastic step. ε is an input, not a parameter — gradients pass through μ and σ cleanly.

5
Decode

Compute the reconstruction i = fθ(zi).

6
Compute reconstruction loss
Lrecon = ‖xi − x̂i‖² (or BCE)
7
Compute the KL term

Closed-form between N(μ, σ²I) and N(0, I):

LKL = −½ · Σj (1 + log σ²j − μ²j − σ²j)
8
Combine the two losses
L = Lrecon + β · LKL

β = 1 recovers the standard VAE.

9
Backpropagate & update

Gradients flow: L → decoder → z → (μ, log σ²) → encoder body. ε sits to the side as a constant input.

10
Repeat

Loop steps 2-9. Watch both Lrecon and LKL — they should both decrease, then trade off.

ALGORITHM 3
The reparameterization trick (computational view)
1
Given μ, log σ² from the encoder

Both are differentiable functions of x and φ. The goal: sample z ~ N(μ, σ²) in a way that lets gradients flow.

2
Convert log σ² to σ
σ = exp(½ · log σ²)

Predicting log-variance is numerically safer.

3
Draw fresh noise
ε ~ N(0, I)

Same shape as z. No parameters depend on ε.

4
Affine combine
z = μ + σ · ε

By construction z ~ N(μ, σ²). The whole expression is differentiable in μ and σ.

5
Gradient flow

During backprop, ∂z/∂μ = 1 and ∂z/∂σ = ε. Both well-defined.

ALGORITHM 4
Deriving the ELBO from log pθ(x)
1
Start with the log marginal likelihood
log pθ(x) = log ∫ pθ(x, z) dz

Intractable as-is because of the integral over z.

2
Introduce a variational distribution qφ(z | x)

Multiply and divide by qφ(z | x) inside the integral.

log pθ(x) = log ∫ qφ(z|x) · [pθ(x, z) / qφ(z|x)] · dz
3
Apply Jensen's inequality

log is concave, so log E[Y] ≥ E[log Y]:

log pθ(x) ≥ Eq[log pθ(x, z) − log qφ(z | x)]
4
Factorize the joint
pθ(x, z) = pθ(x | z) · p(z)
5
Regroup terms
ELBO = Eq[log pθ(x | z)] + Eq[log p(z) − log qφ(z | x)]
6
Recognize the KL divergence
ELBO = Eq[log pθ(x | z)] − KL(qφ(z | x) ‖ p(z))
7
Interpret the two terms

First term = reconstruction. Second term = regulariser. Maximize ELBO ⇔ minimise VAE loss.

ALGORITHM 5
Generating new samples from a trained VAE
1
Discard the encoder

For generation you only need fθ.

2
Sample from the prior
z ~ N(0, I)
3
Decode
x̂ = fθ(z)
4
Repeat for as many samples as you need

Each fresh z draws a fresh sample.

ALGORITHM 6
Computing KL(qφ(z|x) ‖ N(0, I)) in closed form
1
Inputs from the encoder

A vector of means μ ∈ ℝJ and a vector of log-variances log σ² ∈ ℝJ.

2
Per-dimension contribution
klj = −½ · (1 + log σ²j − μ²j − σ²j)
3
Sum across dimensions
KL = Σj=1J klj
4
Average across the batch

Mean over examples in the minibatch. Add to the reconstruction loss.

§ 29Mathematical derivations — Q&A

Worked-out questions that appear in textbooks, PhD qualifying exams, and "explain it on a whiteboard" interviews.

Q1
Derive the ELBO starting from log pθ(x).
Derivation

Using the identity log pθ(x) = log pθ(x) · 1 with 1 = ∫ qφ(z | x) dz:

log pθ(x) = ∫ qφ(z | x) · log pθ(x) · dz
= ∫ qφ(z|x) · log [ pθ(x, z) / pθ(z | x) ] · dz
= Eq[log pθ(x, z) − log qφ(z|x)] + KL(qφ(z|x) ‖ pθ(z|x))

The first term is the ELBO. The second is a non-negative KL. So:

log pθ(x) = ELBO(φ, θ; x) + KL(qφ(z | x) ‖ pθ(z | x))

Consequences: (i) ELBO ≤ log pθ(x). (ii) Maximising ELBO over φ tightens the bound. (iii) Joint maximisation over (φ, θ) is reasonable approximate maximum likelihood.

Q2
Derive the closed-form KL between q = N(μ, σ²·I) and p = N(0, I).
Derivation

For univariate q = N(μ, σ²) and p = N(0, 1):

KL(q ‖ p) = Eq[log q(z)] − Eq[log p(z)]

For a Gaussian, Eq[log q(z)] = −½ · (log(2πσ²) + 1). And Eq[log p(z)] = −½·log(2π) − ½·(μ² + σ²) using E[z²] = μ² + σ². Subtract:

KL = −½ · (1 + log σ² − μ² − σ²)

For J-dimensional diagonal q, sum over dimensions:

KL = −½ · Σj=1J (1 + log σ²j − μ²j − σ²j)
Q3
Why is sampling z ~ N(μ, σ²) directly not differentiable, but z = μ + σ · ε is?
Answer

Direct sampling: z comes from a random oracle, not a function. There's no analytical map between (μ, σ) and the observed z to differentiate.

Reparameterized: z = μ + σ·ε with ε ~ N(0, 1) is a deterministic function of (μ, σ, ε). The random oracle was called before we touched μ and σ.

∂z / ∂μ = 1,   ∂z / ∂σ = ε

Both well-defined, so gradients flow from L through z back into μ and σ, then into encoder weights φ.

Q4
Show that maximising the ELBO is equivalent to minimising KL(qφ(z|x) ‖ pθ(z|x)) when θ is fixed.
Derivation

From Q1: log pθ(x) = ELBO + KL(qφ(z|x) ‖ pθ(z|x)).

For fixed θ, the LHS is constant. So increasing ELBO by Δ must decrease the KL gap by exactly Δ. Therefore:

arg maxφ ELBO = arg minφ KL(qφ(z|x) ‖ pθ(z|x))

This is what makes the ELBO a tractable surrogate.

Q5
Why predict log σ² rather than σ?
Answer

(i) Positivity for free. Variance must be positive. log-space removes the constraint.

(ii) Numerical stability. log σ² ∈ (−∞, ∞); tiny variances become moderate negative numbers instead of vanishing floats.

(iii) Cleaner KL term. The closed-form KL contains log σ² directly.

Implementation: σ = exp(½ · log σ²) when σ is needed.

Q6
Why does the reparameterized gradient have lower variance than the score-function (REINFORCE) gradient?
Answer

Reparameterized: φ L = Eε[ ∇φ ℓ(x, x̂(ε, μφ, σφ)) ]. Differentiates the loss function.

Score-function: φ L = Ez ~ qφ[ ℓ(x, z) · ∇φ log qφ(z|x) ]. Multiplies the gradient of log q by the raw loss value — which is noisy.

In practice the reparameterized version has 1–3 orders of magnitude lower variance, which is why VAE training works with just one Monte Carlo sample per data point.

Q7
Compute E[z²] for z ~ N(μ, σ²).
Derivation
Var(z) = E[z²] − (E[z])²
⇒ E[z²] = Var(z) + (E[z])² = σ² + μ²

This identity is the workhorse of the KL derivation in Q2.

Q8
Why is KL asymmetric? What happens if we use KL(p ‖ q) instead?
Answer

KL is asymmetric because q · log(q/p) weights by q's density, not p's.

  • Reverse KL — KL(q ‖ p) — "mode-seeking". q concentrates on a single mode of p. This is what VAEs use.
  • Forward KL — KL(p ‖ q) — "mean-seeking". q spreads out to cover everything p has mass on.

Reverse KL falls out naturally from the ELBO. Using forward KL would be intractable — it would require sampling from the true posterior, which is what we don't know how to do.

Q9
If the decoder distribution is Gaussian with fixed variance, what does the reconstruction loss reduce to?
Derivation

For pθ(x | z) = N(fθ(z), σ²dec · I):

log pθ(x | z) = const − (1 / (2σ²dec)) · ‖x − fθ(z)‖²

So arg max log pθ(x | z) = arg min ‖x − fθ(z)‖²mean-squared error.

σ²dec becomes an implicit β-like weight between reconstruction and KL.

Q10
For binary pixel data, derive the BCE form of the reconstruction loss.
Derivation

For pθ(x | z) = ∏i Bernoulli(xi; πi) with π = fθ(z):

log pθ(x | z) = Σi [ xi · log πi + (1 − xi) · log(1 − πi) ]

Negating gives binary cross-entropy:

LBCE = −Σi [ xi · log πi + (1 − xi) · log(1 − πi) ]
Q11
What happens to the ELBO at β = 0 and at β → ∞?
Answer

β = 0: KL disappears. Encoder collapses to a Dirac delta. Model becomes a standard autoencoder. Generation from N(0, I) produces garbage.

β → ∞: Only KL matters. Optimal q is the prior, regardless of x — posterior collapse. Reconstruction drops to the level of decoding pure noise.

Sweet spot: β ∈ [0.1, 4] depending on dataset and decoder capacity.

Q12
Express the per-dimension KL contribution at μ = 0, σ = 1.
Derivation
kl = −½ · (1 + log 1 − 0 − 1) = 0

Zero when q matches the prior exactly — the attractor that the regulariser pulls every q(z|x) toward.

§ 30Interview questions — quick answers

The conceptual questions interviewers ask. Short, complete answers — the kind you'd give in 60 seconds at a whiteboard.

i.01
In one sentence, what's the difference between an autoencoder and a variational autoencoder?

An autoencoder maps each input to a point in latent space; a VAE maps each input to a distribution over latent space and adds a KL regulariser that pulls those distributions toward a fixed prior — which is what makes the latent space sampleable for generation.

i.02
Why can a VAE generate but a standard AE cannot?

Because the VAE's KL term forces every q(z|x) to look like a known distribution — typically N(0, I). So you can sample fresh z ~ N(0, I) at test time and feed it to the decoder, and that z will land in the same region of latent space the decoder was trained on. With a standard AE you have no idea what distribution to sample from.

i.03
What is amortized inference, and why does the VAE use it?

Instead of learning a separate posterior qi(z) for each data point, a single shared neural network qφ(z | x) is trained to output posterior parameters for any input. You "amortize" the cost of inference over the whole dataset by paying once to train one network.

i.04
What is posterior collapse? How do you detect it and fix it?

What: the encoder collapses to q(z|x) ≈ p(z) for every x — KL ≈ 0, reconstruction stays bad.

Detect: log per-dimension KL during training. If near-zero with poor reconstruction, that's collapse.

Fix: KL annealing; free bits; weaker decoder; InfoVAE/MMD; richer posterior (normalizing flows).

i.05
Why are VAE samples often blurry?

Two reasons. (i) Pixel-wise Gaussian/Bernoulli decoders optimise mode-averaging losses — the optimal MSE answer when multiple valid reconstructions exist is the pixel-wise mean (blurry). (ii) Reparameterized sampling adds noise; the decoder smooths to be robust. Fixes: richer output distributions (PixelCNN), perceptual/adversarial losses, VQ-VAE with autoregressive priors.

i.06
VAE vs GAN — when do you prefer each?

VAE: meaningful latent space (interpolation, anomaly detection), probabilistic interpretation, easy training, posterior access. GAN: maximum visual quality, no explicit likelihood needed. Modern hybrids (VQ-VAE-2, VQ-GAN) take the best of both.

i.07
VAE vs Diffusion models — how are they related?

A diffusion model can be viewed as a hierarchical VAE with a fixed encoder (forward noising) and a learned decoder (denoising network) at every timestep. Both optimise a variational lower bound. Stable Diffusion uses a VAE to compress to a small latent space, then runs diffusion inside it.

i.08
Why is the diagonal-covariance Gaussian assumption okay for q(z | x)?

It assumes latent dimensions are conditionally independent given x. Loss in expressiveness is small for typical data, gain in tractability is huge: covariance is a vector (J floats) instead of a matrix (J²). For richer posteriors, switch to normalizing flows or full covariance.

i.09
What does a Conditional VAE add?

Both encoder and decoder receive an extra conditioning input y (e.g. a class label). Loss: L = Lrecon(x, x̂ | y) + KL(q(z | x, y) ‖ p(z | y)). Sample z ~ p(z | y) at generation time and decode with the desired y — controllable generation.

i.10
If I increase the latent dimension z_dim from 2 to 100, what changes?

Reconstruction improves (more channels through bottleneck). KL grows linearly with z_dim. Visualisation gets harder. Per-dim KL becomes unevenly used — some dimensions encode structure, others stay at the prior.

i.11
Can VAEs work with discrete latent variables?

Not directly with reparameterization. Workarounds: Gumbel-Softmax (continuous relaxation); VQ-VAE (vector-quantize z, straight-through gradients); REINFORCE (high variance). VQ-VAE became dominant.

i.12
How would you debug a VAE that produces only the average of all training images?

Posterior collapse with strong KL. Checklist: (1) Inspect per-dim KL — lower β or use annealing if all zero. (2) Check decoder's z-weights — if near zero, decoder ignores z. (3) Reduce decoder capacity. (4) Use free bits. (5) Try a richer posterior.

i.13
What if we chose a different prior than N(0, I)?

Mixture-of-Gaussians prior: better for multi-modal data. VampPrior: learned mixture, often outperforms standard. Normalizing flow prior: arbitrarily flexible. Standard normal is the default because closed-form KL is free.

i.14
Walk through a single training step at the tensor level.

Batch x: [B, 784]. (1) Encoder body: [B, 784] → [B, 256]. (2) Two heads: [B, 256] → [B, 20] each. (3) Sample ε: [B, 20]. (4) z = μ + exp(½·log σ²)·ε. (5) Decoder: [B, 20] → [B, 784], sigmoid. (6) BCE summed across pixels, KL summed across dims, both mean over batch. (7) Backward, step.

i.15
What is the "evidence" in "evidence lower bound"?

The evidence is the marginal likelihood of the data: pθ(x) = ∫ pθ(x, z) dz. ELBO is a tractable function that is a lower bound on the log of that evidence.

i.16
How would you use a VAE for anomaly detection?

Train on normal data only. At test time: (i) high reconstruction loss; (ii) low likelihood under prior (encoded μ far from origin); (iii) high KL. Combine or use just (i) for a clean threshold. Works for fraud, defect inspection, network intrusion.