Diffusion Model

A diffusion model is a class of generative models that learn to reverse a gradual noising process to produce high-quality data samples from random noise.

Diffusion models are inspired by non-equilibrium thermodynamics, specifically the process of gradually adding noise to data until it becomes pure random noise, and then learning to reverse this process to generate new data. The forward process is fixed: it incrementally adds Gaussian noise to an input sample over a series of time steps, eventually destroying its structure. The model is trained to predict the noise added at each step, effectively learning the reverse denoising trajectory.

The foundational formulation by Sohl-Dickstein et al. (2015) introduced the concept of diffusion probabilistic models. However, the approach gained widespread attention with the work of Ho et al. (2020) on Denoising Diffusion Probabilistic Models (DDPM), which simplified the training objective and demonstrated state-of-the-art image generation quality. In DDPM, the model is trained to predict the noise component of a noisy image at a given timestep, using a simple mean-squared error loss. Sampling involves starting from pure noise and iteratively applying the learned denoising steps.

A key advantage of diffusion models is their training stability compared to generative adversarial networks (GANs), as they do not require adversarial training. They also offer high sample quality and mode coverage. However, sampling is typically slower than GANs because it requires many sequential denoising steps. Recent advances, such as denoising diffusion implicit models (DDIM) and latent diffusion models, have improved sampling speed and efficiency, enabling practical applications like text-to-image generation (e.g., DALL-E 2, Stable Diffusion).

Why it matters

Diffusion models have become a cornerstone of modern generative AI, powering many state-of-the-art systems for image, audio, and video synthesis. Their ability to produce high-fidelity, diverse samples without adversarial training makes them particularly valuable for creative tools, scientific simulations, and data augmentation. They also enable controllable generation through conditioning, such as text prompts, expanding their utility in design, entertainment, and research.

First appeared

Sohl-Dickstein et al., 2015; Ho et al. DDPM, 2020.

FAQ

How does it work?

Diffusion models work by first defining a forward process that gradually adds Gaussian noise to data over many timesteps until it becomes pure noise. A neural network is then trained to reverse this process, predicting the noise added at each step. During generation, the model starts from random noise and iteratively denoises it to produce a new sample.

How is a diffusion model different from a GAN?

Unlike GANs, which use a generator and discriminator in adversarial training, diffusion models learn a denoising process directly. This makes diffusion models more stable to train and less prone to mode collapse, but sampling is typically slower because it requires many sequential steps. GANs can generate samples in one forward pass but are harder to train.

What are the main limitations of diffusion models?

The primary limitation is slow sampling speed due to the need for many iterative denoising steps. This can make real-time generation challenging. Additionally, the models are computationally expensive to train and require large datasets. Recent methods like latent diffusion and DDIM have mitigated some of these issues by reducing the number of steps or operating in a compressed latent space.