how do diffusion models generate images
Core Mechanism
System Pipeline (ASCII Diagram)
TRAINING PHASE ══════════════════════════════════════════════════════════════════
▲ Variational Lower Bound Reweighting
│
┌─────┴──────────────────────────────────────────────────┐ │ Neural Network (Score / Noise Estimator) │ │ Learns p_θ(xₜ₋₁|xₜ) at each step │ └────────────────────────────────────────────────────────┘
GENERATION PHASE (Reverse Process) ══════════════════════════════════════════════════════════════════
WHAT EMERGES AT EACH STAGE: ──────────────────────────────────────────────────────────────── Early steps (t near T) Late steps (t near 0) ┌──────────────────────┐ ┌──────────────────────┐ │ High-variance scene │ ────► │ Low-variance fine │ │ features: layout, │ │ details, textures, │ │ global structure │ │ sharpness │ │ ("outline first") │ │ ("details later") │ └──────────────────────┘ └──────────────────────┘
TRAJECTORY GEOMETRY (per Wang & Vastola): ──────────────────────────────────────────────────────────────── Image Manifold │ ┌─────▼──────────────────────────────────────────┐ │ │ │ xT ──(rotation)──► x_mid ──(rotation)──► x₀ │ │ │ │ Trajectories are LOW-DIMENSIONAL and │ │ resemble 2D ROTATIONS toward a target │ └──────────────────────────────────────────────────┘
OPTIONAL: TEXT-CONDITIONED GENERATION (T2I) ══════════════════════════════════════════════════════════════════
Text Prompt ┌──────────┐ │ "a cat │ │ on │──────────────────────────────────┐ │ a mat" │ ▼ └──────────┘ ┌─────────────────────────┐ │ Conditioned Denoising │ Pure Noise │ Process (novel │ ┌──────────┐ │ conditions injected │ │ xT ~ N │──────────────────► │ into denoising steps) │ └──────────┘ └────────────┬────────────┘ │ ▼ ┌──────────┐ │ Generated│ │ Image │ └──────────┘
Key Properties of the Generation Process
Wang & Vastola (2023) identified three core properties of the reverse diffusion process across multiple pretrained models (including latent-space models like Stable Diffusion):
- Low-dimensional trajectories: Individual generation trajectories tend to be low-dimensional and resemble 2D rotations. . Coarse-to-fine generation: High-variance scene features like layout emerge earlier in the reverse process, while low-variance fine details emerge later — an "outline first, details later" pattern. . Early perturbation sensitivity: Perturbations applied early in the reverse process have a greater impact on final image content than later ones.
Wang & Vastola (2023) further derive a closed-form solution to the probability flow ODE for a Gaussian distribution, showing the reverse diffusion state rotates toward a gradually-specified target on the image manifold. They note this solution can in principle be used to make generation more efficient by skipping reverse diffusion steps.
Conditional Extensions
Cao et al. (2024) survey how text-to-image diffusion models extend the base mechanism so that novel conditions (beyond text) can be introduced into the denoising process, acknowledging that text conditioning alone does not fully cater to the varied requirements of different applications.
Dennis et al. (2025) note that the physics-inspired family — including denoising diffusion probabilistic models, score-based diffusion models, and Poisson flow generative models — share emphasis on accuracy, robustness, and acceleration as active research directions.
Coverage note: The evidence directly supports the forward/reverse process, trajectory geometry, and coarse-to-fine dynamics. Architectural internals of the neural network (e.g., U-Net structure, attention layers) are not addressed in the retrieved evidence and cannot be described here.
Research smarter with AI-powered citations
Researchly finds and cites academic papers for any research topic in seconds. Used by students across India.