The Art of
Controlled Noise

An interactive exploration of how Stable Diffusion transforms random noise into stunning images through iterative denoising in latent space.

Explore

What is Stable Diffusion?

Stable Diffusion is a latent text-to-image diffusion model. Unlike earlier diffusion models that operated on full-resolution pixel space, it compresses images into a lower-dimensional latent space using a Variational Autoencoder, then learns to denoise in that compressed representation — making it fast enough to run on consumer GPUs.

🧊

Latent Space

Instead of 512×512×3 pixel space, Stable Diffusion works in 64×64×4 latent space — a 48× compression. The VAE encoder/decoder handles the translation.

🔊

Diffusion Process

The forward process gradually adds Gaussian noise over T timesteps. The model learns to reverse this — predicting the noise to subtract at each step.

📝

Text Conditioning

CLIP's text encoder transforms prompts into embeddings. Cross-attention layers in the U-Net attend to these embeddings, steering the denoising toward the described scene.

Classifier-Free Guidance

At inference, the model runs both conditional and unconditional predictions. The guided output = unconditional + scale × (conditional − unconditional), amplifying prompt adherence.

Forward & Reverse Process

The core insight: destroying structure is easy (add noise), learning to restore it is the hard part. The model is trained to predict ε — the noise that was added at each timestep.

Forward: q(x_t | x_{t-1}) = N(x_t; √(1-β_t) · x_{t-1}, β_t · I)
Reverse: p_θ(x_{t-1} | x_t) = N(x_{t-1}; μ_θ(x_t, t), Σ_θ(x_t, t))
Loss: L = E[‖ε − ε_θ(x_t, t)‖²] — simple MSE on noise prediction

Watch Denoising Happen

Drag the slider to move through the diffusion timesteps. At t=T (right), you see pure noise. As t→0 (left), structure emerges from the chaos.

Latent Denoising Visualization

Simulated 2D denoising of a geometric pattern

t = 1000

The Pipeline

Click each stage to explore how text becomes image.

📝
Stage 01
Text Encoder
🎲
Stage 02
Noise Init
🧠
Stage 03
U-Net Denoise
📐
Stage 04
Scheduler
🖼️
Stage 05
VAE Decode

CLIP Text Encoder

Your prompt is tokenized and passed through CLIP's transformer (ViT-L/14 in SD 1.x, OpenCLIP ViT-H in SD 2.x, dual CLIP+OpenCLIP in SDXL). The output is a sequence of 77 token embeddings of dimension 768 (or 1024/2048 for later versions). These embeddings serve as the conditioning signal — injected into the U-Net via cross-attention at multiple resolution levels. Negative prompts create a second embedding set used as the unconditional branch in classifier-free guidance.

Latent Noise Initialization

A tensor of shape [1, 4, 64, 64] is sampled from N(0,1). This is your starting point — pure Gaussian noise in latent space. The seed controls the RNG state, so the same seed + prompt + settings = deterministic output. Some techniques like img2img start from a partially noised version of an encoded input image instead, controlling the strength of transformation.

U-Net Denoising

The U-Net predicts the noise component ε at each timestep. Architecture: encoder–decoder with skip connections, ResNet blocks, and cross-attention layers. The model takes three inputs: the noisy latent z_t, the timestep t (sinusoidal embedding), and the text conditioning c. Self-attention captures spatial relationships; cross-attention attends to the text embeddings. In SDXL, the U-Net has ~2.6B parameters with additional micro-conditioning for resolution and crop coordinates.

Noise Scheduler

The scheduler orchestrates the step-by-step denoising. It defines the noise schedule (β values across timesteps) and computes the updated latent after each U-Net prediction. Common schedulers: DDPM (1000 steps, slow), DDIM (deterministic, skip steps), Euler/Euler Ancestral (fast, simple), DPM++ 2M Karras (high quality at 20-30 steps), UniPC (fastest convergence). The scheduler is independent of the trained model — you can swap them freely at inference.

VAE Decoder

The denoised latent z_0 (shape [1,4,64,64]) is passed through the VAE decoder to produce a [1,3,512,512] pixel image. The decoder is a CNN with upsampling layers, trained alongside the encoder to minimize reconstruction loss + KL divergence. The VAE is the bottleneck for fine detail — which is why community "fine-tuned VAEs" exist to improve faces and text rendering. The decode step happens once at the very end.

Noise vs Signal

Drag across to see the transition from pure noise to structured output.

← Pure Noise (t=T) Denoised Output (t=0) →

Exploring Samplers

Different samplers trade off speed, quality, and determinism. Click to see how each one converges.

Euler
Fast / Simple
Euler A
Stochastic
DPM++ 2M
High Quality
DDIM
Deterministic
LMS
Multi-step
Heun
Accurate / 2× cost
Euler — First-order ODE solver. Takes a single function evaluation per step along the probability flow ODE. Fast and often "good enough" at 20-30 steps. Can produce artifacts at very low step counts. The simplest possible sampler — essentially gradient descent on the denoising trajectory.

A Brief History

2020

DDPM — Denoising Diffusion Probabilistic Models

Ho et al. showed diffusion models can match GANs in image quality. 1000 steps, pixel space, very slow.

2021

CLIP — Contrastive Language-Image Pretraining

OpenAI's CLIP bridged text and images, enabling the conditioning mechanism used in text-to-image generation.

2021

Latent Diffusion — The Key Insight

Rombach et al. (CompVis) moved diffusion to latent space via a pretrained autoencoder, making training and inference dramatically more efficient.

August 2022

Stable Diffusion 1.4 Released

Stability AI open-sourced the weights. First high-quality text-to-image model accessible to everyone. Trained on LAION-5B subset.

November 2022

Stable Diffusion 2.0

Switched to OpenCLIP ViT-H. Higher base resolution (768px), depth-to-image, inpainting model. Community reception was mixed due to NSFW filters and style changes.

July 2023

SDXL 1.0

Dual text encoders (CLIP + OpenCLIP), 2.6B param U-Net, native 1024px generation, refiner model. Major quality leap.

February 2024

Stable Diffusion 3

Replaced U-Net with a Multimodal Diffusion Transformer (MMDiT). Triple text encoder (CLIP × 2 + T5-XXL). Flow matching instead of ε-prediction.