Diffusion

Taking the time to understand the diffusion process will help you to understand how to more effectively use InvokeAI.

Image Space vs. Latent Space

There are two main ways Stable Diffusion works — with images, and latents.

Image Space

Represents images in pixel form that you look at. This is the final visual output you see.

Latent Space

Represents compressed inputs. It’s in latent space that Stable Diffusion processes images.

Core Components

To fully understand the diffusion process, we need to understand a few more terms: U-Net, CLIP, and conditioning.

U-Net

A model trained on a large number of latent images with known amounts of random noise added. The U-Net can be given a slightly noisy image and it will predict the pattern of noise needed to subtract from the image in order to recover the original.

CLIP & Conditioning

CLIP is a model that tokenizes and encodes text into conditioning. This conditioning guides the model during the denoising steps to produce a new image.

The U-Net and CLIP work together during the image generation process at each denoising step. The U-Net removes noise so that the result is similar to images in its training set, while CLIP guides the U-Net towards creating images that are most similar to your prompt.

When you generate an image using text-to-image, multiple steps occur in latent space:

Noise Generation: Random noise is generated at the chosen height and width. The noise’s characteristics are dictated by the seed. This noise tensor is passed into latent space. We’ll call this noise A.
Noise Prediction: Using a model’s U-Net, a noise predictor examines noise A and the words tokenized by CLIP from your prompt (conditioning). It generates its own noise tensor to predict what the final image might look like in latent space. We’ll call this noise B.
Subtraction: Noise B is subtracted from noise A in an attempt to create a latent image consistent with the prompt. This step is repeated for the number of sampler steps chosen.
Decoding: The VAE decodes the final latent image from latent space into image space.

Image-to-image is a similar process, with only the first step being different:

Encoding & Adding Noise: The input image is encoded from image space into latent space by the VAE. Noise is then added to the input latent image.
- Denoising Strength dictates how many noise steps are added, and the amount of noise added at each step.
- A strength of 0 means there are 0 steps and no noise added, resulting in an unchanged image.
- A strength of 1 results in the image being completely replaced with noise and a full set of denoising steps are performed.
Noise Prediction: Using a model’s U-Net, a noise predictor examines the noisy latent image and the conditioning from your prompt. It generates its own noise tensor to predict the final image.
Subtraction: The predicted noise is subtracted from the current noise in an attempt to create a latent image consistent with the prompt. This step is repeated for the remaining sampler steps.
Decoding: The VAE decodes the final latent image from latent space into image space.

Summary

Putting it all together

A Model provides the CLIP prompt tokenizer, the VAE, and a U-Net (where noise prediction occurs given a prompt and initial noise tensor).
A Noise Scheduler (e.g. DPM++ 2M Karras) schedules the subtraction of noise from the latent image across the sampler steps chosen. Less noise is usually subtracted at higher sampler steps.

This site was designed and developed by Aether Fox Studio.