Image Space
Represents images in pixel form that you look at. This is the final visual output you see.
Taking the time to understand the diffusion process will help you to understand how to more effectively use InvokeAI.
There are two main ways Stable Diffusion works — with images, and latents.
Image Space
Represents images in pixel form that you look at. This is the final visual output you see.
Latent Space
Represents compressed inputs. It’s in latent space that Stable Diffusion processes images.
To fully understand the diffusion process, we need to understand a few more terms: U-Net, CLIP, and conditioning.
U-Net
A model trained on a large number of latent images with known amounts of random noise added. The U-Net can be given a slightly noisy image and it will predict the pattern of noise needed to subtract from the image in order to recover the original.
CLIP & Conditioning
CLIP is a model that tokenizes and encodes text into conditioning. This conditioning guides the model during the denoising steps to produce a new image.
The U-Net and CLIP work together during the image generation process at each denoising step. The U-Net removes noise so that the result is similar to images in its training set, while CLIP guides the U-Net towards creating images that are most similar to your prompt.
When you generate an image using text-to-image, multiple steps occur in latent space:
Image-to-image is a similar process, with only the first step being different:
0 means there are 0 steps and no noise added, resulting in an unchanged image.1 results in the image being completely replaced with noise and a full set of denoising steps are performed.Putting it all together
DPM++ 2M Karras) schedules the subtraction of noise from the latent image across the sampler steps chosen. Less noise is usually subtracted at higher sampler steps.