Yumo Xu | Website | LinkedIn | X

Oct 31 2025

Diffusion models now underpin many of the most compelling GenAI applications, spanning images (Stable Diffusion), video (Sora), audio (Stable Audio), text (Gemini Diffusion), and increasingly multimodal experiences. Recent advances can make the landscape feel extremely fast‑moving, but the core intuition still fits on a page. This post is a minimal‑math guide to the essentials: how forward noising works, how the reverse one‑step target is set up, and why the $\epsilon$‑MSE objective appears. If you’re comfortable with Gaussians and basic variance identities, you have all you need.

Series note: This is Part 1 of a short series on diffusion. It serves as a primer on the core mechanics rather than a survey of all variants. Future posts will build on these basics, exploring more advanced topics such as latent diffusion, text conditioning/guidance, faster samplers, DiT, and consistency/flow-matching methods.

1. Overview

Figure: Forward vs. Reverse. We corrupt data with a controlled noise schedule (forward) and train a model to undo one step of corruption at a time. Training uses one-shot noising to create supervision; generation applies the learned one-step denoiser repeatedly from pure noise back to data. Most modern image systems use latent diffusion (diffusing in a compressed latent space rather than pixels) for speed and quality; the illustrated mechanics still apply.

Figure: Forward vs. Reverse. We corrupt data with a controlled noise schedule (forward) and train a model to undo one step of corruption at a time. Training uses one-shot noising to create supervision; generation applies the learned one-step denoiser repeatedly from pure noise back to data. Most modern image systems use latent diffusion (diffusing in a compressed latent space rather than pixels) for speed and quality; the illustrated mechanics still apply.

It’s straightforward to add a bit noise to any given sample (e.g., an image), and do it multiple times iteratively. In the context of diffusion models, we call this noising chain the forward process and model each forward step with $q(x_{t}|x_{t-1}), t\in[1, T]$. Specifically, we introduce a variance schedule $\beta_t$, defined as:

$$ q(x_t|x_{t-1}) = \mathcal{N}(x_t; \sqrt{1-\beta_t} x_{t-1}, \beta_t I) $$

Intuition: Think of $\beta_t$ as a volume knob for noise. Early steps (when $t$ is small) keep structure with high signal‑to‑noise ratio (SNR). Late steps (when $t$ is large) wash it out, leading to low SNR.

Our end goal is to create a meaningful outcome $x_0$ from pure noise $x_{T}$. This requires modeling the reverse process $x_t \rightarrow x_{t-1}$. However, unlike the forward process, the marginal $q(x_{t-1}|x_t)$ is intractable. To address the issue, we learn a reverse model $p_\theta(x_{t-1}|x_t)$ parameterized by $\theta$, to de-noise any given input:

$$ p_\theta(x_{t-1}|x_t) = \mathcal{N}(x_{t-1}; \mu_\theta(x, t), \sigma_t^2I) $$

At inference time, we adopt the model’s estimate to step from pure noise $x_T \sim \mathcal{N}(0, I)$ down to $x_0$.

Intuition: Why diffusion at all? In the forward process, we purposely corrupt data step by step, so that the reconstruction problem becomes a sequence of small denoising moves. At step $t$, the model answers: “Given this partly noisy sample and the noise level, what tiny correction can I do to increase signal‑to‑noise a bit?” Over many steps these nudges add up to a clean sample. This staged problem is easier to learn than mapping pure noise straight to data in one jump.

Next, we will introduce approaches that leverage these processes to efficiently:

  1. Creating training data: sample $x_t$ in one shot from $x_0$ in 2. Forward process: data creation