Yumo Xu | Website | LinkedIn | X
Nov 13 2025
Stable Diffusion succeeds by performing diffusion in a latent space, not pixels. Latents act as the interface between pixels and semantics — a compact space where denoising becomes both cheaper and easier to steer with language. This post explains why that design won, how the pieces fit (VAE → denoiser → guidance), and what actually changes from pixel-space DDPM.
Series note: This is Part 2 of a short series on diffusion. We assume you are comfortable with forward/reverse basics of diffusion models which can be found in Part 1: From Noise to Data: A Minimal-Math Guide to Diffusion Models.
Table of contents

Framework of latent diffusion (Rombach et al, 2022). The image encoder and decoder are first pre-trained via VAE objectives. They are then frozen during diffusion training where the denoiser is jointly trained with the text encoder for conditioning. During inference, the first row (image encoder and diffuser) is discarded; the system denoises from a latent of pure noise z_T and a text prompt y.
Stable diffusion is built on top of latent diffusion. As shown in the figure above, there are three major components in latent diffusion:
Why latents? Pixel-space diffusion models (e.g., DDPMs on 3x512×512) worked, but were costly and struggled to align global semantics: each pixel was treated equally, even when text only affected high-level content. Suppose we represent a 3x512×512 image with a 4x64×64 latent (downsample ×8, channels 4). That’s ~48× fewer spatial positions, and similar compute/memory savings for attention and U-Net blocks, while the VAE preserves semantic structure ****that’s easier to denoise and condition. This table provides a high-level comparison between the two:
| Aspect\Space | Pixel space | Latent space |
|---|---|---|
| Tensor dimension | 3x512×512 | 4x64×64 |
| Conditioning | Low-level textures | High-level semantics |
| Step cost | High | Low |
| Sampling speed | Slow | Fast |
Pixels → latents → pixels Say you have a 3x512×512 RGB image. To compress it into a smaller latent representation (say, 4x64x64), VAE goes through the following stages: