Stable Diffusion — How It Works

Overview

What is Stable Diffusion?

Stable Diffusion is a latent text-to-image diffusion model. Unlike earlier diffusion models that operated on full-resolution pixel space, it compresses images into a lower-dimensional latent space using a Variational Autoencoder, then learns to denoise in that compressed representation — making it fast enough to run on consumer GPUs.

🧊

Latent Space

Instead of 512×512×3 pixel space, Stable Diffusion works in 64×64×4 latent space — a 48× compression. The VAE encoder/decoder handles the translation.

🔊

Diffusion Process

The forward process gradually adds Gaussian noise over T timesteps. The model learns to reverse this — predicting the noise to subtract at each step.

📝

Text Conditioning

CLIP's text encoder transforms prompts into embeddings. Cross-attention layers in the U-Net attend to these embeddings, steering the denoising toward the described scene.

⚡

Classifier-Free Guidance

At inference, the model runs both conditional and unconditional predictions. The guided output = unconditional + scale × (conditional − unconditional), amplifying prompt adherence.

Architecture

The Pipeline

Click each stage to explore how text becomes image.

📝

Stage 01

Text Encoder

🎲

Stage 02

Noise Init

🧠

Stage 03

U-Net Denoise

📐

Stage 04

Scheduler

🖼️

Stage 05

VAE Decode

CLIP Text Encoder

Your prompt is tokenized and passed through CLIP's transformer (ViT-L/14 in SD 1.x, OpenCLIP ViT-H in SD 2.x, dual CLIP+OpenCLIP in SDXL). The output is a sequence of 77 token embeddings of dimension 768 (or 1024/2048 for later versions). These embeddings serve as the conditioning signal — injected into the U-Net via cross-attention at multiple resolution levels. Negative prompts create a second embedding set used as the unconditional branch in classifier-free guidance.

Latent Noise Initialization

A tensor of shape [1, 4, 64, 64] is sampled from N(0,1). This is your starting point — pure Gaussian noise in latent space. The seed controls the RNG state, so the same seed + prompt + settings = deterministic output. Some techniques like img2img start from a partially noised version of an encoded input image instead, controlling the strength of transformation.

U-Net Denoising

The U-Net predicts the noise component ε at each timestep. Architecture: encoder–decoder with skip connections, ResNet blocks, and cross-attention layers. The model takes three inputs: the noisy latent z_t, the timestep t (sinusoidal embedding), and the text conditioning c. Self-attention captures spatial relationships; cross-attention attends to the text embeddings. In SDXL, the U-Net has ~2.6B parameters with additional micro-conditioning for resolution and crop coordinates.

Noise Scheduler

The scheduler orchestrates the step-by-step denoising. It defines the noise schedule (β values across timesteps) and computes the updated latent after each U-Net prediction. Common schedulers: DDPM (1000 steps, slow), DDIM (deterministic, skip steps), Euler/Euler Ancestral (fast, simple), DPM++ 2M Karras (high quality at 20-30 steps), UniPC (fastest convergence). The scheduler is independent of the trained model — you can swap them freely at inference.

VAE Decoder

The denoised latent z_0 (shape [1,4,64,64]) is passed through the VAE decoder to produce a [1,3,512,512] pixel image. The decoder is a CNN with upsampling layers, trained alongside the encoder to minimize reconstruction loss + KL divergence. The VAE is the bottleneck for fine detail — which is why community "fine-tuned VAEs" exist to improve faces and text rendering. The decode step happens once at the very end.

Evolution

A Brief History

2020

DDPM — Denoising Diffusion Probabilistic Models

Ho et al. showed diffusion models can match GANs in image quality. 1000 steps, pixel space, very slow.

2021

CLIP — Contrastive Language-Image Pretraining

OpenAI's CLIP bridged text and images, enabling the conditioning mechanism used in text-to-image generation.

2021

Latent Diffusion — The Key Insight

Rombach et al. (CompVis) moved diffusion to latent space via a pretrained autoencoder, making training and inference dramatically more efficient.

August 2022

Stable Diffusion 1.4 Released

Stability AI open-sourced the weights. First high-quality text-to-image model accessible to everyone. Trained on LAION-5B subset.

November 2022

Stable Diffusion 2.0

Switched to OpenCLIP ViT-H. Higher base resolution (768px), depth-to-image, inpainting model. Community reception was mixed due to NSFW filters and style changes.

July 2023

SDXL 1.0

Dual text encoders (CLIP + OpenCLIP), 2.6B param U-Net, native 1024px generation, refiner model. Major quality leap.

February 2024

Stable Diffusion 3

Replaced U-Net with a Multimodal Diffusion Transformer (MMDiT). Triple text encoder (CLIP × 2 + T5-XXL). Flow matching instead of ε-prediction.

The Art of
Controlled Noise

What is Stable Diffusion?

Latent Space

Diffusion Process

Text Conditioning

Classifier-Free Guidance

Forward & Reverse Process

Watch Denoising Happen

Latent Denoising Visualization

The Pipeline

CLIP Text Encoder

Latent Noise Initialization

U-Net Denoising

Noise Scheduler

VAE Decoder

Noise vs Signal

Exploring Samplers

A Brief History

DDPM — Denoising Diffusion Probabilistic Models

CLIP — Contrastive Language-Image Pretraining

Latent Diffusion — The Key Insight

Stable Diffusion 1.4 Released

Stable Diffusion 2.0

SDXL 1.0

Stable Diffusion 3

The Art ofControlled Noise

What is Stable Diffusion?

Latent Space

Diffusion Process

Text Conditioning

Classifier-Free Guidance

Forward & Reverse Process

Watch Denoising Happen

Latent Denoising Visualization

The Pipeline

CLIP Text Encoder

Latent Noise Initialization

U-Net Denoising

Noise Scheduler

VAE Decoder

Noise vs Signal

Exploring Samplers

A Brief History

DDPM — Denoising Diffusion Probabilistic Models

CLIP — Contrastive Language-Image Pretraining

Latent Diffusion — The Key Insight

Stable Diffusion 1.4 Released

Stable Diffusion 2.0

SDXL 1.0

Stable Diffusion 3

The Art of
Controlled Noise