An interactive exploration of how Stable Diffusion transforms random noise into stunning images through iterative denoising in latent space.
Stable Diffusion is a latent text-to-image diffusion model. Unlike earlier diffusion models that operated on full-resolution pixel space, it compresses images into a lower-dimensional latent space using a Variational Autoencoder, then learns to denoise in that compressed representation — making it fast enough to run on consumer GPUs.
Instead of 512×512×3 pixel space, Stable Diffusion works in 64×64×4 latent space — a 48× compression. The VAE encoder/decoder handles the translation.
The forward process gradually adds Gaussian noise over T timesteps. The model learns to reverse this — predicting the noise to subtract at each step.
CLIP's text encoder transforms prompts into embeddings. Cross-attention layers in the U-Net attend to these embeddings, steering the denoising toward the described scene.
At inference, the model runs both conditional and unconditional predictions. The guided output = unconditional + scale × (conditional − unconditional), amplifying prompt adherence.
The core insight: destroying structure is easy (add noise), learning to restore it is the hard part. The model is trained to predict ε — the noise that was added at each timestep.
Drag the slider to move through the diffusion timesteps. At t=T (right), you see pure noise. As t→0 (left), structure emerges from the chaos.
Simulated 2D denoising of a geometric pattern
Click each stage to explore how text becomes image.
Your prompt is tokenized and passed through CLIP's transformer (ViT-L/14 in SD 1.x, OpenCLIP ViT-H in SD 2.x, dual CLIP+OpenCLIP in SDXL). The output is a sequence of 77 token embeddings of dimension 768 (or 1024/2048 for later versions). These embeddings serve as the conditioning signal — injected into the U-Net via cross-attention at multiple resolution levels. Negative prompts create a second embedding set used as the unconditional branch in classifier-free guidance.
A tensor of shape [1, 4, 64, 64] is sampled from N(0,1). This is your starting point — pure Gaussian noise in latent space. The seed controls the RNG state, so the same seed + prompt + settings = deterministic output. Some techniques like img2img start from a partially noised version of an encoded input image instead, controlling the strength of transformation.
The U-Net predicts the noise component ε at each timestep. Architecture: encoder–decoder with skip connections, ResNet blocks, and cross-attention layers. The model takes three inputs: the noisy latent z_t, the timestep t (sinusoidal embedding), and the text conditioning c. Self-attention captures spatial relationships; cross-attention attends to the text embeddings. In SDXL, the U-Net has ~2.6B parameters with additional micro-conditioning for resolution and crop coordinates.
The scheduler orchestrates the step-by-step denoising. It defines the noise schedule (β values across timesteps) and computes the updated latent after each U-Net prediction. Common schedulers: DDPM (1000 steps, slow), DDIM (deterministic, skip steps), Euler/Euler Ancestral (fast, simple), DPM++ 2M Karras (high quality at 20-30 steps), UniPC (fastest convergence). The scheduler is independent of the trained model — you can swap them freely at inference.
The denoised latent z_0 (shape [1,4,64,64]) is passed through the VAE decoder to produce a [1,3,512,512] pixel image. The decoder is a CNN with upsampling layers, trained alongside the encoder to minimize reconstruction loss + KL divergence. The VAE is the bottleneck for fine detail — which is why community "fine-tuned VAEs" exist to improve faces and text rendering. The decode step happens once at the very end.
Drag across to see the transition from pure noise to structured output.
Different samplers trade off speed, quality, and determinism. Click to see how each one converges.
Ho et al. showed diffusion models can match GANs in image quality. 1000 steps, pixel space, very slow.
OpenAI's CLIP bridged text and images, enabling the conditioning mechanism used in text-to-image generation.
Rombach et al. (CompVis) moved diffusion to latent space via a pretrained autoencoder, making training and inference dramatically more efficient.
Stability AI open-sourced the weights. First high-quality text-to-image model accessible to everyone. Trained on LAION-5B subset.
Switched to OpenCLIP ViT-H. Higher base resolution (768px), depth-to-image, inpainting model. Community reception was mixed due to NSFW filters and style changes.
Dual text encoders (CLIP + OpenCLIP), 2.6B param U-Net, native 1024px generation, refiner model. Major quality leap.
Replaced U-Net with a Multimodal Diffusion Transformer (MMDiT). Triple text encoder (CLIP × 2 + T5-XXL). Flow matching instead of ε-prediction.