Skip to content

Diffusion

Taking the time to understand the diffusion process will help you to understand how to more effectively use InvokeAI.

There are two main ways Stable Diffusion works — with images, and latents.

Image Space

Represents images in pixel form that you look at. This is the final visual output you see.

Latent Space

Represents compressed inputs. It’s in latent space that Stable Diffusion processes images.

To fully understand the diffusion process, we need to understand a few more terms: U-Net, CLIP, and conditioning.

U-Net

A model trained on a large number of latent images with known amounts of random noise added. The U-Net can be given a slightly noisy image and it will predict the pattern of noise needed to subtract from the image in order to recover the original.

CLIP & Conditioning

CLIP is a model that tokenizes and encodes text into conditioning. This conditioning guides the model during the denoising steps to produce a new image.

The U-Net and CLIP work together during the image generation process at each denoising step. The U-Net removes noise so that the result is similar to images in its training set, while CLIP guides the U-Net towards creating images that are most similar to your prompt.

When you generate an image using text-to-image, multiple steps occur in latent space:

  1. Noise Generation: Random noise is generated at the chosen height and width. The noise’s characteristics are dictated by the seed. This noise tensor is passed into latent space. We’ll call this noise A.
  2. Noise Prediction: Using a model’s U-Net, a noise predictor examines noise A and the words tokenized by CLIP from your prompt (conditioning). It generates its own noise tensor to predict what the final image might look like in latent space. We’ll call this noise B.
  3. Subtraction: Noise B is subtracted from noise A in an attempt to create a latent image consistent with the prompt. This step is repeated for the number of sampler steps chosen.
  4. Decoding: The VAE decodes the final latent image from latent space into image space.

Putting it all together

  • A Model provides the CLIP prompt tokenizer, the VAE, and a U-Net (where noise prediction occurs given a prompt and initial noise tensor).
  • A Noise Scheduler (e.g. DPM++ 2M Karras) schedules the subtraction of noise from the latent image across the sampler steps chosen. Less noise is usually subtracted at higher sampler steps.
This site was designed and developed by Aether Fox Studio.