Diffusion Models: Learning to Create by Learning to Destroy

Nish · July 2, 2026

⏱️ 27 min read

Table of Contents

In 2015, a paper borrowed an idea from nonequilibrium thermodynamics: if you slowly destroy structure in data, you can learn to rebuild it, and a machine that can rebuild structure from nothing is a machine that can create. The paper was largely ignored for five years. Then a 2020 follow-up made the recipe work, and within four years it was generating photorealistic images, minute-long videos, protein structures, and robot trajectories. This post builds diffusion models from first principles: the forward process that destroys, the reverse process that creates, the surprisingly simple loss that trains it, and then the ten-year evolution that turned a thousand-step curiosity into the engine behind Stable Diffusion, Sora, and AlphaFold 3.

TL;DR

  • A diffusion model turns the impossible problem of “generate an image from nothing” into a thousand easy problems: remove a little noise. Each step is plain supervised regression, which is why training is so stable compared to GANs.
  • The forward (noising) process is fixed and has a closed form: you can jump to any noise level \(t\) directly via \(\mathbf{x}_t = \sqrt{\bar\alpha_t}\,\mathbf{x}_0 + \sqrt{1-\bar\alpha_t}\,\boldsymbol\epsilon\). This one property makes training tractable.
  • The entire training objective collapses to “predict the noise”: \(\lVert \boldsymbol\epsilon - \boldsymbol\epsilon_\theta(\mathbf{x}_t, t) \rVert^2\). Predicting the noise, predicting the clean data, and predicting the score (the gradient of the log-density) are the same thing in different coordinates.
  • The field’s evolution is a sequence of named pain points and fixes: too slow to sample (DDIM, distillation, consistency models), no control over outputs (classifier-free guidance), too expensive in pixel space (latent diffusion), architecture that would not scale (diffusion transformers).
  • The modern formulation, flow matching / rectified flow, drops the thermodynamics entirely: draw a straight line from noise to data and learn the velocity field along it. Stable Diffusion 3 and FLUX train this way; diffusion turns out to be the curved-path special case.
  • The recipe generalizes to anything you can corrupt gradually: video, audio, molecules, robot actions, and even text, where diffusion language models generate all tokens in parallel and refine them together.

The problem: sampling from a distribution you cannot write down

A generative model has one job: given examples drawn from some unknown distribution \(p(\mathbf{x})\) (photos, songs, protein structures), produce new samples that could plausibly have come from it. The distribution of natural images over, say, a 512×512 grid of pixels is a monstrously complicated object living in a 786,432-dimensional space. Nobody can write down its density. All we have is samples.

By 2019 the main neural approaches each paid a different price:

  • GANs produced the sharpest images, but training is an adversarial game between two networks with no explicit likelihood anywhere. They are notoriously unstable and prone to mode collapse, where the generator finds a few crowd-pleasing outputs and ignores the rest of the distribution.
  • VAEs train stably by maximizing a likelihood bound, but their one-shot decoders tended to produce blurry samples.
  • Normalizing flows give exact likelihoods, but require every layer to be invertible with a tractable Jacobian determinant, which severely constrains the architecture.
  • Autoregressive models (the PixelCNN family) generate images pixel by pixel with tractable likelihoods, but sampling is glacially sequential and long-range coherence suffers.

Notice the burden the first three share: each asks a network to transform noise into a finished sample in one shot. One forward pass must simultaneously decide global composition, local texture, lighting, and every correlation between distant pixels. Diffusion models make a different wager. Like autoregressive models, they decompose generation into many small steps, but each step refines the whole sample at a progressively finer level of detail instead of committing to one pixel at a time, and each step is so small that it is individually easy to learn. The cost, as we will see, is paying for a network call at every one of those steps, and roughly a decade of research has been spent clawing that cost back down.

Destroy slowly, then learn to reverse

The word diffusion comes from thermodynamics: drop ink into water and molecular motion disperses it until nothing but uniform grey remains. Physics happily describes this forward direction. What it cannot do is run the film backwards for you, and that is exactly the part a diffusion model learns.

Sohl-Dickstein et al. (2015) turned this into an algorithm with two halves:

  1. Forward process: gradually add Gaussian noise to a data point over \(T\) steps (a fixed, hand-designed corruption; nothing is learned here) until nothing recognizable survives.
  2. Reverse process: train a network to undo one small step of that corruption, then chain it backwards from pure noise to a brand-new sample.
Five versions of the same photograph at increasing noise levels, from clean at x_0 to indistinguishable static at x_T, with blue forward arrows labelled q and purple reverse arrows labelled p-theta.
The two halves of a diffusion model, shown on a real photo noised with the DDPM schedule (T=1000). The forward process q (blue) is fixed and just mixes in Gaussian noise; by t=600 only 3% of the signal variance survives. The reverse process p (purple) is the learned part: a network trained to walk the chain right to left. Diagram layout follows Figure 2 of Ho et al. (2020). Photo: official U.S. Navy portrait of Grace Hopper (public domain).

The forward process, precisely

Start with a data point \(\mathbf{x}_0\) (an image, scaled to \([-1, 1]\)). Each forward step samples from a Gaussian centered on a slightly shrunk version of the previous state:

\[q(\mathbf{x}_t \mid \mathbf{x}_{t-1}) = \mathcal{N}\!\left(\mathbf{x}_t;\; \sqrt{1-\beta_t}\,\mathbf{x}_{t-1},\; \beta_t \mathbf{I}\right)\]

where \(\beta_1, \dots, \beta_T\) is a pre-chosen noise schedule of small numbers (DDPM used \(T = 1000\) with \(\beta\) ramping linearly from \(10^{-4}\) to \(0.02\)). The \(\sqrt{1-\beta_t}\) shrink factor is not decoration: it keeps the overall variance from blowing up, so the process converges to a standard Gaussian rather than an ever-expanding cloud. You can check that if \(\mathbf{x}_{t-1}\) has unit variance, then \((1-\beta_t) + \beta_t = 1\) means \(\mathbf{x}_t\) does too. This is called the variance-preserving formulation.

Applied step by step this would be a nuisance: to get a training example at noise level \(t = 700\) you would simulate 700 sampling steps. The property that makes diffusion trainable at all is that Gaussians compose. Define \(\alpha_t = 1 - \beta_t\) and \(\bar\alpha_t = \prod_{s=1}^{t} \alpha_s\), and the whole chain collapses into a single jump:

\[q(\mathbf{x}_t \mid \mathbf{x}_0) = \mathcal{N}\!\left(\mathbf{x}_t;\; \sqrt{\bar\alpha_t}\,\mathbf{x}_0,\; (1-\bar\alpha_t)\,\mathbf{I}\right) \quad\Longleftrightarrow\quad \mathbf{x}_t = \sqrt{\bar\alpha_t}\,\mathbf{x}_0 + \sqrt{1-\bar\alpha_t}\,\boldsymbol\epsilon,\;\; \boldsymbol\epsilon \sim \mathcal{N}(\mathbf{0}, \mathbf{I})\]

Read \(\bar\alpha_t\) as a signal dial that slides from 1 (all data) to 0 (all noise). Every noisy image is just a weighted blend of the original and one draw of Gaussian noise, and you can manufacture a training example at any noise level with one multiply-add. No simulation, no chain.

Why end at pure noise? Because \(\mathcal{N}(\mathbf{0}, \mathbf{I})\) is a distribution we can sample from trivially. The forward process is a bridge between “the distribution we want but cannot sample” and “a distribution we can sample but do not want”. Generation walks the bridge in reverse.

How fast you turn the signal dial matters more than it first appears. DDPM’s linear schedule destroys most of the image surprisingly early; Nichol & Dhariwal (2021) showed the final ~20% of its steps operate on essentially pure noise, wasting model capacity, and proposed a cosine schedule that spends more time at informative noise levels. The cleanest lens is the signal-to-noise ratio \(\text{SNR}(t) = \bar\alpha_t / (1 - \bar\alpha_t)\), which later work (Kingma et al., 2021) made the primary design object.

Two line charts comparing linear and cosine noise schedules: the fraction of signal remaining over time, and log signal-to-noise ratio over time.
Two views of the noise schedule (T=1000). Left: the linear DDPM schedule (blue) crushes the signal early, spending its last few hundred steps on images that are already indistinguishable from noise; the cosine schedule (green) decays more evenly (comparison after Figure 5 of Nichol & Dhariwal, 2021). Right: the same schedules as log SNR, the quantity later work treats as the real design choice. The dashed line marks the point where signal and noise have equal power.

Learning to reverse

Generation needs \(q(\mathbf{x}_{t-1} \mid \mathbf{x}_t)\): given a noisy image, what did it look like one step less noisy? Bayes’ rule says this exists but depends on the unknown data distribution, which is the whole problem. Two facts rescue us.

Fact 1: the reverse of a small Gaussian step is also Gaussian. When \(\beta_t\) is small enough, \(q(\mathbf{x}_{t-1} \mid \mathbf{x}_t)\) is well-approximated by a Gaussian whose mean and variance we can hope to learn. This is the payoff for taking a thousand tiny steps instead of one big one: each backward hop is a simple, unimodal move, and all the multimodal complexity of “which image is this becoming?” emerges from composing them. So we posit a learned reverse chain

\[p_\theta(\mathbf{x}_{t-1} \mid \mathbf{x}_t) = \mathcal{N}\!\left(\mathbf{x}_{t-1};\; \boldsymbol\mu_\theta(\mathbf{x}_t, t),\; \sigma_t^2 \mathbf{I}\right)\]

with one network, conditioned on the timestep, serving every noise level.

Fact 2: the reverse step becomes exactly computable if you are told the answer. Condition on the clean image \(\mathbf{x}_0\) and the posterior is a known Gaussian with mean

\[\tilde{\boldsymbol\mu}_t(\mathbf{x}_t, \mathbf{x}_0) = \frac{\sqrt{\bar\alpha_{t-1}}\,\beta_t}{1-\bar\alpha_t}\,\mathbf{x}_0 + \frac{\sqrt{\alpha_t}\,(1-\bar\alpha_{t-1})}{1-\bar\alpha_t}\,\mathbf{x}_t\]

which is just a weighted average of the noisy input and the clean answer. So the only thing the network is missing is \(\mathbf{x}_0\) itself, and during training we have it. The learning problem quietly became supervised: here is a noisy image, here is the clean one, fill in the blank.

Ho, Jain & Abbeel (2020) took one more reparameterization step that made everything click. Since \(\mathbf{x}_t = \sqrt{\bar\alpha_t}\,\mathbf{x}_0 + \sqrt{1-\bar\alpha_t}\,\boldsymbol\epsilon\), knowing the noise \(\boldsymbol\epsilon\) is the same as knowing \(\mathbf{x}_0\). Substituting that into \(\tilde{\boldsymbol\mu}_t\) and simplifying, the ideal reverse mean becomes

\[\tilde{\boldsymbol\mu}_t = \frac{1}{\sqrt{\alpha_t}}\left(\mathbf{x}_t - \frac{\beta_t}{\sqrt{1-\bar\alpha_t}}\,\boldsymbol\epsilon\right)\]

so the network’s whole job reduces to predicting the noise that was mixed in: \(\boldsymbol\epsilon_\theta(\mathbf{x}_t, t)\). Train it with the most pedestrian loss in machine learning:

\[\mathcal{L}_{\text{simple}} = \mathbb{E}_{t,\, \mathbf{x}_0,\, \boldsymbol\epsilon}\left[\left\lVert \boldsymbol\epsilon - \boldsymbol\epsilon_\theta\!\left(\sqrt{\bar\alpha_t}\,\mathbf{x}_0 + \sqrt{1-\bar\alpha_t}\,\boldsymbol\epsilon,\; t\right) \right\rVert^2\right]\]

This is not a heuristic pulled from thin air: it is the variational lower bound (the same ELBO machinery that trains a VAE), with its per-timestep weights deliberately dropped because the unweighted version trains better in practice. The full derivation is worth seeing once, so here it is.

The ELBO derivation, from scratch (click to expand)

We want to maximize the likelihood $p_\theta(\mathbf{x}_0)$, which is intractable because it marginalizes over all reverse paths. As with VAEs, we bound it using the forward process $q$ as the inference distribution:

$$ -\log p_\theta(\mathbf{x}_0) \le \mathbb{E}_q\!\left[-\log \frac{p_\theta(\mathbf{x}_{0:T})}{q(\mathbf{x}_{1:T} \mid \mathbf{x}_0)}\right] =: \mathcal{L}_{\text{VLB}} $$

Expanding both chains (the forward one factorizes as $\prod_t q(\mathbf{x}_t \mid \mathbf{x}_{t-1})$, the reverse one as $p(\mathbf{x}_T)\prod_t p_\theta(\mathbf{x}_{t-1} \mid \mathbf{x}_t)$), rewriting each forward factor with Bayes' rule conditioned on $\mathbf{x}_0$, and telescoping, the bound splits into per-step terms:

$$ \mathcal{L}_{\text{VLB}} = \underbrace{D_{\text{KL}}\big(q(\mathbf{x}_T \mid \mathbf{x}_0)\,\Vert\, p(\mathbf{x}_T)\big)}_{L_T:\ \approx 0\ \text{by design}} + \sum_{t=2}^{T} \underbrace{D_{\text{KL}}\big(q(\mathbf{x}_{t-1} \mid \mathbf{x}_t, \mathbf{x}_0)\,\Vert\, p_\theta(\mathbf{x}_{t-1} \mid \mathbf{x}_t)\big)}_{L_{t-1}} + \underbrace{\big({-\mathbb{E}_q \log p_\theta(\mathbf{x}_0 \mid \mathbf{x}_1)}\big)}_{L_0} $$

$L_T$ has no parameters: the schedule is chosen so the forward process actually lands on $\mathcal{N}(\mathbf{0},\mathbf{I})$. Each $L_{t-1}$ is a KL between two Gaussians, which is a closed-form expression. The comparison target is the forward posterior

$$ q(\mathbf{x}_{t-1} \mid \mathbf{x}_t, \mathbf{x}_0) = \mathcal{N}\big(\mathbf{x}_{t-1};\; \tilde{\boldsymbol\mu}_t(\mathbf{x}_t, \mathbf{x}_0),\; \tilde\beta_t \mathbf{I}\big), \qquad \tilde\beta_t = \frac{1-\bar\alpha_{t-1}}{1-\bar\alpha_t}\,\beta_t $$

obtained by multiplying the two Gaussians that Bayes' rule demands and completing the square. With fixed variances, each KL reduces to a scaled squared distance between means:

$$ L_{t-1} = \mathbb{E}_q\!\left[\frac{1}{2\sigma_t^2}\big\lVert \tilde{\boldsymbol\mu}_t(\mathbf{x}_t, \mathbf{x}_0) - \boldsymbol\mu_\theta(\mathbf{x}_t, t) \big\rVert^2\right] + \text{const} $$

Now parameterize both means in terms of noise. Substituting $\mathbf{x}_0 = \big(\mathbf{x}_t - \sqrt{1-\bar\alpha_t}\,\boldsymbol\epsilon\big)/\sqrt{\bar\alpha_t}$ into $\tilde{\boldsymbol\mu}_t$ gives the expression in the main text, and choosing the same functional form for $\boldsymbol\mu_\theta$ (with $\boldsymbol\epsilon_\theta$ in place of $\boldsymbol\epsilon$) makes everything cancel except

$$ L_{t-1} = \mathbb{E}\!\left[\frac{\beta_t^2}{2\sigma_t^2\,\alpha_t\,(1-\bar\alpha_t)}\big\lVert \boldsymbol\epsilon - \boldsymbol\epsilon_\theta(\mathbf{x}_t, t) \big\rVert^2\right] $$

Dropping the ugly time-dependent weight (equivalently: reweighting the ELBO so every noise level counts equally) yields $\mathcal{L}_{\text{simple}}$. Ho et al. found the unweighted version produces better samples, and later analysis showed why: the dropped weights over-emphasize barely-noisy timesteps whose denoising task is nearly trivial.

The resulting algorithms are almost anticlimactic. Training:

def training_step(x0):                      # x0: batch of clean data
    t = randint(1, T)                       # random noise level
    eps = randn_like(x0)                    # fresh Gaussian noise
    x_t = sqrt(abar[t]) * x0 + sqrt(1 - abar[t]) * eps
    return mse_loss(eps_pred(x_t, t), eps)  # predict the noise

And generation, walking the chain backwards with the learned mean plus a fresh dash of noise at each step (the variance \(\sigma_t^2\) is typically fixed to \(\beta_t\) or \(\tilde\beta_t\)):

def sample():
    x = randn(shape)                        # start from pure noise, t = T
    for t in reversed(range(1, T + 1)):
        eps = eps_pred(x, t)
        mean = (x - beta[t] / sqrt(1 - abar[t]) * eps) / sqrt(alpha[t])
        z = randn_like(x) if t > 1 else 0
        x = mean + sigma[t] * z             # one denoising step
    return x

Here is the entire pipeline running on a 2D toy dataset, where we can watch both directions at once:

Two rows of five scatter plots. Top row: a two-moons point cloud progressively dissolving into an isotropic Gaussian blob as t increases. Bottom row: the reverse process starting from a Gaussian blob and progressively reforming the two-moons shape.
The full loop on a 2D dataset, echoing the swiss-roll figure that opened Sohl-Dickstein et al. (2015). Top (blue): the fixed forward process dissolves the two-moons distribution into a standard Gaussian. Bottom (purple): ancestral sampling runs the learned chain backwards, reassembling the distribution from pure noise. Every panel is a real simulation (the reverse chain uses the exact score of the mixture, i.e. what a perfectly trained network would learn), not an artist’s impression.

It is worth pausing on why this training recipe is so much more pleasant than a GAN’s. There is no adversary, no minimax game, no mode collapse: just a single network minimizing mean squared error against a fixed, well-defined target at every noise level. Averaged over the dataset, the optimal \(\boldsymbol\epsilon_\theta\) at each point in space is a plain conditional expectation. Stability is the default, not an achievement.

What does the network actually look like? DDPM used a U-Net: a convolutional encoder-decoder with skip connections, natural for image-to-image prediction, with the timestep injected via learned embeddings. Remember that choice; it becomes a bottleneck later in the story.

Four ways to read the same model

The noise-prediction view is standard, but the same trained network admits four equivalent readings, and fluency in switching between them is most of what it takes to read the diffusion literature.

1. Noise prediction. \(\boldsymbol\epsilon_\theta(\mathbf{x}_t, t)\): what noise was added?

2. Clean-data prediction. Invert the mixing equation: \(\hat{\mathbf{x}}_0 = \big(\mathbf{x}_t - \sqrt{1-\bar\alpha_t}\,\boldsymbol\epsilon_\theta\big)/\sqrt{\bar\alpha_t}\). Every denoising step secretly computes a full guess of the final image and then only trusts it a tiny bit. (At high noise the guess is a blurry average of many plausible images; the early steps of generation are choosing which of those images to commit to.)

3. Score prediction. The score function is the gradient of the log-density, \(\nabla_{\mathbf{x}} \log p_t(\mathbf{x})\): at every point in image space, an arrow pointing “uphill” toward more plausible data. For the Gaussian-blurred distribution at noise level \(t\), a short calculation (or Tweedie’s formula, below) gives

\[\nabla_{\mathbf{x}_t} \log p_t(\mathbf{x}_t) = -\frac{\boldsymbol\epsilon_\theta(\mathbf{x}_t, t)}{\sqrt{1-\bar\alpha_t}}\]

Predicting the noise is estimating the score, up to a known rescaling. This one line is the bridge between the two research traditions in the next section.

4. Velocity prediction. Defining \(\mathbf{v} = \sqrt{\bar\alpha_t}\,\boldsymbol\epsilon - \sqrt{1-\bar\alpha_t}\,\mathbf{x}_0\) (Salimans & Ho, 2022) gives a target that stays well-scaled at all noise levels, where \(\boldsymbol\epsilon\)-prediction degenerates near \(t = T\) (predicting the noise from almost-pure noise says nothing about the image). This numerical detail matters enough that v-prediction became standard in distillation and in models like Stable Diffusion 2’s 768-v checkpoint and Imagen Video.

Tying 2 and 3 together is a lovely classical result, Tweedie’s formula: the best guess of the clean signal given a Gaussian-corrupted observation is the observation nudged by the score,

\[\mathbb{E}[\mathbf{x}_0 \mid \mathbf{x}_t] = \frac{\mathbf{x}_t + (1-\bar\alpha_t)\,\nabla_{\mathbf{x}_t}\log p_t(\mathbf{x}_t)}{\sqrt{\bar\alpha_t}}\]

Denoising and knowing the shape of the distribution are the same knowledge. That is the deep reason a denoiser can be a generative model at all.

The score-based view: the same idea through another door

Here the story forks, because a second research line arrived at the same place from a completely different direction, and its framing is the one that ultimately unified everything.

Song & Ermon (2019) asked: why not model the score directly? The score neatly sidesteps the curse of likelihood-based models: since the normalization constant \(Z\) of a density does not depend on \(\mathbf{x}\), it vanishes under the gradient, so a network can output score fields without any architectural constraints. And once you have a score field, Langevin dynamics turns it into a sampler: start anywhere and repeatedly take a small step uphill plus a controlled amount of noise,

\[\mathbf{x}_{i+1} = \mathbf{x}_i + \eta\,\nabla_{\mathbf{x}}\log p(\mathbf{x}_i) + \sqrt{2\eta}\,\mathbf{z}_i, \qquad \mathbf{z}_i \sim \mathcal{N}(\mathbf{0}, \mathbf{I})\]

Run long enough with small enough steps, this provably samples from \(p\). Elegant, and it fails in practice, for an instructive reason: in high dimensions, real data lives near thin manifolds. A learned score is only accurate where training data was dense, and a Langevin chain initialized randomly starts in the vast emptiness between modes, where the estimated arrows are garbage and mixing between modes takes forever.

The fix should sound familiar: add noise, at many scales. Smearing the data with Gaussian noise inflates those thin manifolds into fat clouds that cover space, giving the score signal everywhere; heavily-noised versions guide the chain from anywhere toward the data region, lightly-noised versions sharpen the final details. Song & Ermon trained one network to estimate the score at every noise level and annealed through them from coarse to fine.

Three panels showing a three-mode 2D distribution at increasing noise levels, each overlaid with purple arrows depicting the score vector field pointing toward the modes; at high noise the three modes have merged into one broad basin.
The score field (purple arrows, capped in length for legibility) of the same three-mode distribution at three noise levels, computed exactly. At low noise the arrows give precise guidance near the modes but the modes are far apart; at high noise the distribution melts into one broad basin whose arrows guide a sampler from anywhere in space. Annealing from right to left is what makes score-based sampling work. Visualization style after Yang Song’s score-based modeling blog.

Squint at this and it is a diffusion model: many noise levels, one network that denoises at all of them, sampling that works its way from heavy noise to none. The DDPM paper itself noted its objective was equivalent to denoising score matching, and via the bridge equation above the two camps’ networks learn the same function in different units.

Song et al. (2021) then delivered the unification. Send the number of steps to infinity and the forward process becomes a stochastic differential equation (SDE), \(\mathrm{d}\mathbf{x} = f(\mathbf{x}, t)\,\mathrm{d}t + g(t)\,\mathrm{d}\mathbf{w}\). A classical result (Anderson, 1982) says this can be reversed, and the reverse SDE needs exactly one unknown ingredient: the score. Better still, there exists a deterministic ODE, the probability-flow ODE, that transports the same distributions with no randomness at all:

\[\frac{\mathrm{d}\mathbf{x}}{\mathrm{d}t} = f(\mathbf{x}, t) - \tfrac{1}{2}\,g(t)^2\,\nabla_{\mathbf{x}}\log p_t(\mathbf{x})\]

DDPM’s step-by-step sampling and annealed Langevin dynamics are just two discretizations of the same continuous object. As Yang Song put it, this is the field’s wave-mechanics-versus-matrix-mechanics moment: two formalisms, one theory. And recasting sampling as solving a differential equation handed the field a whole toolbox it was about to need badly.

Pain point #1: sampling is agonizingly slow

The original DDPM needed a full network forward pass for each of its 1000 steps: minutes per image, versus a single pass for a GAN. The first great wave of diffusion research is best read as an assault on that number.

Three panels showing sampling trajectories from noise to a three-mode 2D distribution: DDPM ancestral sampling paths are erratic random walks, probability-flow ODE paths are smooth curves, rectified flow paths are nearly straight lines. Three highlighted trajectories start from the same noise points in every panel.
How three generations of samplers travel from noise (dark dots) to data (blue cloud), simulated exactly on the same toy distribution. The three highlighted paths start from identical noise draws in each panel. DDPM’s stochastic chain (left) wanders like a drunk with a compass and needs ~1000 steps. The probability-flow ODE that DDIM follows (middle) glides deterministically along curved paths, solvable in tens of steps. Rectified flow (right, met later in the post) learns nearly straight paths, inching toward the one-step ideal. The straight-vs-curved comparison is the central visual device of Liu et al. (2022).

DDIM (Song, Meng & Ermon, 2020) made the first breakthrough, and it is a conceptual one: the training objective only ever depends on the marginals \(q(\mathbf{x}_t \mid \mathbf{x}_0)\), never on the Markov chain connecting them. So after training you are free to swap in a different, non-Markovian generative process with the same marginals, including a fully deterministic one:

\[\mathbf{x}_{t-1} = \sqrt{\bar\alpha_{t-1}}\,\hat{\mathbf{x}}_0 + \sqrt{1-\bar\alpha_{t-1}}\,\boldsymbol\epsilon_\theta(\mathbf{x}_t, t)\]

(predict the clean image, then re-noise it to level \(t-1\), deterministically). Because there is no noise injection to average out, you can take big strides across the schedule: 50 steps yields near-full quality, 10 to 20 is usable. Determinism also gave diffusion something GAN users took for granted: a meaningful latent space, where the initial noise \(\mathbf{x}_T\) identifies the output and can be interpolated or inverted for editing.

The SDE view explains why this works, since DDIM turns out to be an integrator of the probability-flow ODE, and it opened the door to importing decades of numerical methods: higher-order solvers like DPM-Solver (2022) cut high-quality sampling to ~10 steps with no retraining, and Karras et al.’s EDM (2022) tidied the whole design space (schedules, parameterizations, solvers) into a modular framework whose choices are still defaults today.

Below ten steps, better solvers hit a wall (the ODE’s curvature, visible in the middle panel above, is the binding constraint), and the field switched strategies: distill the slow model into a fast one. Progressive distillation (2022) trains a student to match two teacher steps in one, then repeats, halving 1024 steps down to 4. Consistency models (Song et al., 2023) train a network to map any point on an ODE trajectory directly to its endpoint, enabling one-step generation with a few-step refinement option. Latent consistency models and adversarially-distilled variants like SDXL-Turbo (both 2023) brought real-time generation to production models. The thousand-step tax has been, for most practical purposes, repealed.

Pain point #2: control

A perfect unconditional model is a slot machine: pull the lever, get a random plausible image. Almost every application instead wants “a photo of an astronaut riding a horse”. The question is how to steer.

The score view makes conditioning look easy. Bayes’ rule, hit with \(\nabla_{\mathbf{x}}\log\), turns multiplication into addition:

\[\nabla_{\mathbf{x}}\log p(\mathbf{x} \mid c) = \nabla_{\mathbf{x}}\log p(\mathbf{x}) + \nabla_{\mathbf{x}}\log p(c \mid \mathbf{x})\]

Scores of composable pieces just add. Classifier guidance (Dhariwal & Nichol, 2021) implemented exactly this: train a classifier on noisy images, add its gradient (scaled up for stronger effect) to the unconditional score during sampling. Combined with architecture tuning, it produced the “diffusion models beat GANs” moment on ImageNet. But it needs a separate classifier robust to every noise level, and gradients through a classifier are a noisy steering signal.

Classifier-free guidance (Ho & Salimans, 2021) is the trick that ate the world. During training, randomly drop the conditioning signal (replace the caption with a null token ∅ maybe 10% of the time), so a single network learns both the conditional and unconditional scores. At sampling time, evaluate both and extrapolate past the conditional prediction, away from the unconditional one:

\[\tilde{\boldsymbol\epsilon} = \boldsymbol\epsilon_\theta(\mathbf{x}_t, \varnothing) + w\,\big(\boldsymbol\epsilon_\theta(\mathbf{x}_t, c) - \boldsymbol\epsilon_\theta(\mathbf{x}_t, \varnothing)\big)\]

The difference vector is an implicit classifier (it points in the direction that makes the image more like the prompt), and \(w\) turns that knob. With \(w = 1\) you get plain conditional sampling; production text-to-image systems run \(w\) around 5 to 8, deliberately over-sharpening toward the prompt at a known cost: diversity shrinks and images drift toward over-saturated, archetypal renderings. That fidelity-diversity dial is visible even in two dimensions:

Three scatter plots over the same four-cluster density contours. Unconditional samples cover all four clusters; conditional samples concentrate on the two left clusters with some spillover toward the boundary; strongly guided samples pull tightly into the left clusters and away from the class boundary.
Classifier-free guidance simulated exactly on a 2D mixture with two classes (left clusters = class A, right = class B; grey contours show the unconditional density), in the spirit of Figure 1 of Ho & Salimans (2021). At w=0 sampling ignores the label. At w=1 samples follow the true class-A conditional, including its fuzzy boundary with B. At w=4 the extrapolated score pushes samples away from anything B-like: cleaner class identity, but the samples huddle around the class cores, visibly under-covering the true conditional. That is exactly the fidelity-versus-diversity tradeoff of high guidance scales in image models.

A convention trap when reading papers. The original paper writes guidance as \((1+w)\,\boldsymbol\epsilon_c - w\,\boldsymbol\epsilon_\varnothing\), where \(w=0\) means unguided; most codebases (and the formula above) use the scale where \(w=1\) means unguided. Stable Diffusion’s default “CFG scale 7.5” is the second convention. Same algebra, off-by-one bookkeeping.

With guidance in hand, text-to-image exploded: GLIDE (December 2021), then in 2022 DALL·E 2 (a diffusion decoder driven by CLIP embeddings) and Imagen (a frozen large language model as text encoder, feeding a cascade of pixel-space diffusion models). Diffusion stopped being a research curiosity and became a consumer product.

Pain point #3: pixels are expensive

Every denoising step in DDPM-style models runs the full network at full image resolution, over and over. At 512×512×3 that is 786k values being pushed through a U-Net dozens to hundreds of times, and most of that capacity is spent modeling imperceptible high-frequency texture. Two ideas broke the deadlock, and together they define the modern stack.

Latent diffusion (Rombach et al., 2021) split the job in two: first train an autoencoder that compresses images into a perceptually-equivalent latent space, then run the entire diffusion process on those latents. A 512×512×3 image becomes a 64×64×4 latent: 48× fewer values for the denoiser to process at every single step. The autoencoder handles “make it look like a crisp photograph”; diffusion handles the part it is uniquely good at, the global semantic composition. Text conditioning enters through cross-attention layers in the denoiser. This architecture, trained at scale and released with open weights, is Stable Diffusion, and its August 2022 release put frontier-quality image generation on consumer GPUs and spawned an entire ecosystem (DreamBooth, ControlNet, LoRA fine-tunes) almost overnight.

Schematic comparing pixel-space diffusion, where the denoising network runs on the full 512 by 512 image at every step, with latent diffusion, where an encoder compresses the image 48 times before the diffusion loop and a decoder reconstructs it afterwards; a label marks where the text prompt enters via cross-attention.
Where the expensive loop runs. Pixel-space diffusion (top) pays for full resolution on every denoising step. Latent diffusion (bottom) pays the encoder/decoder cost once and runs the many-step loop in a 48x smaller space, with the prompt injected into the denoiser via cross-attention. This is the architecture of Stable Diffusion, simplified from Figure 3 of Rombach et al. (2021).

Diffusion transformers (DiT) (Peebles & Xie, 2022) then replaced the venerable U-Net itself. Chop the latent into patches, feed them to a transformer as tokens, condition on the timestep via adaptive layer norm, and something important happens: image generation inherits the transformer’s clean, predictable scaling behavior. Compute in, quality out, no architectural hand-tuning. That property is why DiT became the backbone of the next generation, including Stable Diffusion 3 and, as it later emerged, Sora.

The reframing: flow matching

By 2022 the theory stack had grown baroque: forward chains, ELBOs, posteriors, SDE reversals. Two nearly simultaneous papers, rectified flow (Liu et al., Sep 2022) and flow matching (Lipman et al., Oct 2022), asked the impolite question: what if we skip all of it?

Their recipe fits in three lines. Draw a noise sample \(\mathbf{x}_0 \sim \mathcal{N}(\mathbf{0},\mathbf{I})\) and a data sample \(\mathbf{x}_1\). Connect them with a straight line, \(\mathbf{x}_t = (1-t)\,\mathbf{x}_0 + t\,\mathbf{x}_1\). Train a network to predict the velocity along it:

\[\mathcal{L}_{\text{FM}} = \mathbb{E}_{t,\,\mathbf{x}_0,\,\mathbf{x}_1}\Big[\big\lVert \mathbf{v}_\theta(\mathbf{x}_t, t) - (\mathbf{x}_1 - \mathbf{x}_0) \big\rVert^2\Big]\]

(Note the time convention flips here: flow matching runs \(t\) from 0 at noise to 1 at data, the reverse of diffusion notation. Nearly every reader trips on this once.) To generate, integrate the learned ODE \(\mathrm{d}\mathbf{x}/\mathrm{d}t = \mathbf{v}_\theta(\mathbf{x}, t)\) from noise to data. That’s the whole framework: no Markov chain, no ELBO, no score, no thermodynamics. The average of straight conditional lines is not itself straight (the network learns the expected velocity, and where lines cross, paths bend), but they are far straighter than diffusion’s, as the right-hand panel of the sampler figure shows, and straighter paths mean coarser integration steps for free. Rectified flow adds a “reflow” step that iteratively straightens them further, marching toward one-step generation.

The punchline is that this is not a rival theory. Diffusion’s noising process is recovered as one particular (curved) choice of interpolation path, and the velocity, score, noise and clean-data parameterizations remain affine translations of one another. Flow matching is diffusion with the historical scaffolding removed, and the field voted with its feet: Stable Diffusion 3 and FLUX are rectified-flow transformers, and flow matching is now the default formulation in most new work, image or otherwise.

There is a satisfying irony in where this landed. Diffusion earned its slot-machine reputation on stochastic sampling, yet the mature form of the field is deterministic transport: learn a velocity field, solve an ODE. What survived from the original thermodynamic story is not the randomness but the two ideas underneath it: corrupt data gradually toward a simple distribution, and learn only the local rule for going back.

Everywhere at once

Track the whole arc on one map before the final stops:

Timeline from 2015 to 2025 with five lanes: foundations and reframings, faster sampling, control and guidance, scale and architecture, and new domains. Milestones include DDPM in 2020, DDIM and score SDEs, classifier-free guidance and latent diffusion in 2021, the text-to-image wave and flow matching in 2022, consistency models in 2023, Sora, SD3 and AlphaFold 3 in 2024, and diffusion LLMs in 2025.
A decade of diffusion in five lanes. Larger dots mark field-defining moments. Reading down any vertical slice shows how the threads fed each other: the 2020-21 foundations enabled the 2022 scale-up, whose cost pressures drove the sampling-speed lane, while flow matching quietly replaced the theory underneath the whole stack.

Notice how the “new domains” lane fills up precisely as the core technology stabilizes. The recipe (choose a corruption process, learn to undo it step by step, guide it with whatever conditioning you have) turns out to care very little about what is being corrupted:

  • Video. Video Diffusion Models (2022) extended the U-Net across time; Sora (Feb 2024) made the leap to a DiT over spacetime patches of compressed video latents, producing minute-long coherent clips, and Veo 3 (2025) added synchronized native audio. The framing that video generation might serve as a “world simulator” traces straight back to these models.
  • Science. AlphaFold 3 (2024) replaced its predecessor’s structure module with a diffusion module that denoises raw 3D atom coordinates, extending structure prediction beyond proteins to DNA, RNA and ligand complexes. Molecule and materials generation use the same pattern on atomic graphs and coordinates.
  • Robotics. Diffusion Policy (2023) generates robot action sequences conditioned on observations. The properties that matter here are exactly diffusion’s strengths: multimodality (two valid ways around an obstacle should not average into the obstacle) and stable training on small datasets.
  • Text. The odd one out, since tokens are discrete and language modeling belongs to autoregression. But masked-diffusion language models (LLaDA, 2025) and commercial systems (Mercury, Gemini Diffusion, both 2025) generate whole sequences in parallel and refine them over a few steps, trading the left-to-right bottleneck for reported thousand-plus tokens-per-second decoding and the ability to edit globally mid-generation. Whether this challenges autoregression at frontier quality is genuinely open as of this writing; the interesting part is why it might: the iterative-refinement principle transfers even when “noise” means masked tokens rather than Gaussians.

What to take away

If the details fade, keep these five compressions:

  1. Diffusion converts generation into supervised learning. One impossible leap becomes a thousand regressions with known targets. Stability is structural, not lucky.
  2. One network, four costumes. Noise, clean data, score, velocity: affine translations of the same learned function. Most “new” formulations in this field are a change of variables.
  3. Denoising is distribution knowledge. Tweedie’s formula says the optimal denoiser is the score of the data distribution. That identity, not any architecture, is the load-bearing wall.
  4. The history is a list of named pain points. Slow sampling → DDIM, solvers, distillation, consistency. No control → guidance. Pixel cost → latents. Scaling → transformers. If you remember the problems, the papers arrange themselves.
  5. The mature form is transport. Flow matching restates a decade of machinery as: draw a line from noise to data, learn the velocity, solve the ODE. Expect new domains to adopt that formulation directly and skip the thermodynamics.

A useful mental model to leave with: a diffusion model is a sculptor who was never taught to sculpt, only to look at a block and say what does not belong. Applied once, that skill is trivial; applied a thousand times in sequence, it is indistinguishable from creation.

Sources and further reading

Citation Information

If you find this content useful & plan on using it, please consider citing it using the following format:

@misc{nish-blog,
  title = {Diffusion Models: Learning to Create by Learning to Destroy},
  author = {Nish},
  howpublished = {\url{https://www.nishbhana.com/Diffusion-Models/}},
  note = {[Online; accessed]},
  year = {2026}
}

x.com, Facebook