Skip to content

FP8 Storage

FP8 Storage cuts a model’s VRAM footprint roughly in half by keeping weights on the GPU in 8-bit floating-point format (float8_e4m3fn). During inference, each layer’s weights are cast on-the-fly back up to the compute precision (FP16/BF16), then cast back to FP8 after the forward pass — so quality is largely preserved.

It pairs well with Low-VRAM mode: low-VRAM mode streams layers between RAM and VRAM, while FP8 Storage shrinks the layers themselves.

  • Nvidia GPU on Windows or Linux. FP8 Storage uses CUDA tensor types and is silently disabled on CPU and MPS.
  • CUDA 12.x and recent PyTorch. The float8_e4m3fn dtype was added in PyTorch 2.1 — InvokeAI’s bundled versions satisfy this.

There is no hardware requirement for FP8 compute — InvokeAI casts back to FP16/BF16 for math. This means FP8 Storage works on GPUs that do not natively support FP8 matmul (e.g. RTX 30-series), at a small per-step throughput cost.

InvokeAI’s FP8 path stores weights in FP8 and casts them back to BF16/FP16 on each forward pass via its own register_forward_pre_hook / register_forward_hook wrappers (the same skip list as diffusers’ apply_layerwise_casting, but applied to every nn.Module — including diffusers ModelMixin subclasses — so it composes correctly with InvokeAI’s CustomLinear and partial loading). The practical benefit of toggling FP8 Storage depends on what your GPU can do natively. There are three tiers:

RTX 30-series and older Ampere workstation cards — VRAM win only

Section titled “RTX 30-series and older Ampere workstation cards — VRAM win only”

The toggle works as advertised: the UNet / transformer drops by roughly 50% on the GPU. Per-step latency is the same or marginally slower because every forward pass adds an FP8 → BF16 cast on entry and a BF16 → FP8 cast on exit. This is the largest target group: 3090 owners squeezing FLUX into 24 GB benefit the most.

RTX 40-series, RTX 50-series, and Hopper — VRAM win today, compute win possible later

Section titled “RTX 40-series, RTX 50-series, and Hopper — VRAM win today, compute win possible later”

These GPUs have native FP8 tensor cores. The toggle still buys you the same ~50% VRAM reduction today, because the forward pass still runs in BF16 — the hook casts weights back up to compute precision before each layer. If InvokeAI later wires up a true FP8 matmul path (e.g. via torchao), the same toggle will also unlock compute speedups on this hardware. Until then, treat the benefit as “VRAM only, same as Ampere”.

float8_e4m3fn is a pure storage dtype in PyTorch and works on any CUDA device, so pre-Ampere cards (GTX 16-series, RTX 20-series, etc.) get the same ~50% VRAM reduction as Ampere. There are no native FP8 tensor cores on these GPUs, so the throughput trade-off is the same as on the 30-series: cast in, compute in BF16/FP16, cast back out.

FP8 Storage is silently disabled on anything that is not CUDA. On CPU PyTorch technically supports FP8 dtypes, but the cast operations are software-emulated and end up costing more than the memory savings buy back, so InvokeAI gates the entire path on device.type == "cuda". If you toggle it on CPU or MPS, the loader skips the cast and returns the model unchanged with no log line.

FP8 Storage is a per-model setting, configured from the Model Manager:

  1. Open the Model Manager.
  2. Select a model (Main, ControlNet, or T2I-Adapter).
  3. Under Default Settings, toggle FP8 Storage (Save VRAM).
  4. Click Save.

The setting takes effect on the next load. If the model is already in the cache, InvokeAI evicts the cached copy automatically so the new setting applies — even if a generation is currently using the model (the eviction is deferred until the generation finishes).

FP8 Storage is only applied to layers where the precision trade-off is acceptable:

Model typeFP8 applied?
Main models (SD1, SD2, SDXL)Yes
FLUX.1 / FLUX.2 KleinYes
ControlNet, T2I-AdapterYes
VAENo — visible decode-quality regression
Text encoders, tokenizersNo — small models, no benefit
Z-Image (any variant)No — dtype mismatch with skipped layers
LoRA, ControlLoRANo — patched into base, not run alone

Within a supported model, norm layers, position/patch embeddings, and proj_in/proj_out are skipped so precision-sensitive tiny learned scalars (e.g. FLUX RMSNorm.scale) aren’t crushed to FP8. This mirrors the diffusers default skip list.

FP8 Storage is near-lossless for most workloads because:

  • Norms and embeddings (the precision-sensitive layers) are skipped.
  • The actual matmul still happens in FP16/BF16 — FP8 is only the on-GPU storage format.

That said, some artifacts have been reported on:

  • VAEs — never cast (the toggle has no effect on VAE submodels).
  • Heavy LoRA stacks — patching is unaffected, but very precision-sensitive LoRAs may show slight drift. Compare a side-by-side if your workflow depends on subtle LoRA behavior.

If you see unexpected quality regressions, disable FP8 Storage on the affected model and re-run.

FP8 + partial loading: fully supported. FP8 Storage shrinks the layers; partial loading streams them between RAM and VRAM as needed. Use both on tight VRAM budgets.

(For why FP8 Storage doesn’t stack on top of GGUF / NF4 / int8 checkpoints, see the callout at the top of this page.)

”I toggled FP8 Storage but VRAM usage didn’t change”

Section titled “”I toggled FP8 Storage but VRAM usage didn’t change””

The cache eviction is immediate for idle models, but deferred until the next unlock if the model is mid-generation. Wait for the current generation to finish, then start a new one — the next load will use the new setting.

If VRAM still hasn’t dropped:

  • Check the InvokeAI log for FP8 layerwise casting enabled for <model name>. If the line isn’t there, the model is on the exclusion list (VAE, text encoder, Z-Image, LoRA — see table above).
  • Confirm you are on CUDA. FP8 Storage is silently disabled on CPU and MPS.

Disable FP8 Storage for that model in Model Manager and reload. If quality is restored, the model has FP8-sensitive layers that fall outside the default skip list. Please open an issue with the model name and a side-by-side comparison.

You’re on a PyTorch version that predates FP8 support. Reinstall InvokeAI using the official launcher — the bundled torch version supports FP8.

If FP8 Storage misbehaves — crash, quality regression, OOM that shouldn’t happen — please open a GitHub issue and include:

  • What you did: the workflow / generation step that triggered the problem, and whether it reproduces every time.
  • Model: exact name and variant (e.g. “FLUX.2 Klein 9B Diffusers”, “SDXL Base 1.0 single-file”), and whether the file is a full-precision checkpoint or already quantized (GGUF / NF4 / int8).
  • LoRAs: whether any LoRAs (or ControlLoRAs) are stacked on the model, and how many.
  • Other toggles: Low-VRAM mode on/off, any cpu_only text encoder setting, configured VRAM limit.
  • GPU: model and VRAM size (e.g. “RTX 3090 24 GB”, “RTX 4070 Ti 12 GB”).
  • OS: Windows or Linux, plus driver / CUDA version if you have it.
  • Logs: the InvokeAI log around the failure — in particular the FP8 layerwise casting enabled for <model> line (or its absence) and any traceback.

A side-by-side image comparison (FP8 on vs. FP8 off, same seed) is extremely useful for quality regressions.

This site was designed and developed by Aether Fox Studio.