Yumo Xu | Website | LinkedIn | X

Nov 13 2025

Stable Diffusion succeeds by performing diffusion in a latent space, not pixels. Latents act as the interface between pixels and semantics — a compact space where denoising becomes both cheaper and easier to steer with language. This post explains why that design won, how the pieces fit (VAE → denoiser → guidance), and what actually changes from pixel-space DDPM.

Series note: This is Part 2 of a short series on diffusion. We assume you are comfortable with forward/reverse basics of diffusion models which can be found in Part 1: From Noise to Data: A Minimal-Math Guide to Diffusion Models.

Table of contents

1. Why latents won

Framework of latent diffusion (Rombach et al, 2022). The image encoder and decoder are first pre-trained via VAE objectives. They are then frozen during diffusion training where the denoiser is jointly trained with the text encoder for conditioning. During inference, the first row (image encoder and diffuser) is discarded; the system denoises from a latent of pure noise z_T and a text prompt y.

Framework of latent diffusion (Rombach et al, 2022). The image encoder and decoder are first pre-trained via VAE objectives. They are then frozen during diffusion training where the denoiser is jointly trained with the text encoder for conditioning. During inference, the first row (image encoder and diffuser) is discarded; the system denoises from a latent of pure noise z_T and a text prompt y.

Stable diffusion is built on top of latent diffusion. As shown in the figure above, there are three major components in latent diffusion:

  1. VAE: the semantic bottleneck that maps from the pixel space to the semantic space
  2. U-Net: the denoiser that generates from the next step latent from noisy latent
  3. Conditioning & guidance: the controller that programs the generation with a prompt

Why latents? Pixel-space diffusion models (e.g., DDPMs on 3x512×512) worked, but were costly and struggled to align global semantics: each pixel was treated equally, even when text only affected high-level content. Suppose we represent a 3x512×512 image with a 4x64×64 latent (downsample ×8, channels 4). That’s ~48× fewer spatial positions, and similar compute/memory savings for attention and U-Net blocks, while the VAE preserves semantic structure ****that’s easier to denoise and condition. This table provides a high-level comparison between the two:

Aspect\Space Pixel space Latent space
Tensor dimension 3x512×512 4x64×64
Conditioning Low-level textures High-level semantics
Step cost High Low
Sampling speed Slow Fast

2. VAE: the semantic bottleneck

Pixels → latents → pixels Say you have a 3x512×512 RGB image. To compress it into a smaller latent representation (say, 4x64x64), VAE goes through the following stages: