[ComfyUI Advanced] Speeding Up with Custom Nodes! The Tradeoff Between Speed and Quality


- ComfyUI-MultiGPU is an essential node
- TeaCache and WaveSpeed offer significant effects but reduce quality
- torch.compile can avoid degradation with proper settings
Introduction
Hello, I’m Easygoing.
In this advanced edition of ComfyUI, we’ll explore how to speed things up using custom nodes.
Custom Nodes for Speeding Up
The two custom nodes I’ll introduce this time are as follows:
- ComfyUI-MultiGPU (Optimizes VRAM management)
- ComfyUI-TeaCache (Uses caching for speed improvements)
1. ComfyUI-MultiGPU
First up is ComfyUI-MultiGPU.
ComfyUI-MultiGPU is a custom node that optimizes VRAM management.
With ComfyUI-MultiGPU, you can utilize the VRAM of multiple GPUs, improving memory usage efficiency.
Additionally, you can set the model loading destination to system RAM instead of VRAM, making it an essential node even if you only have one GPU.
Installing ComfyUI-MultiGPU
Now, let’s look at how to use ComfyUI-MultiGPU.
First, install ComfyUI-MultiGPU using ComfyUI-Manager.

When using GGUF format files, you’ll need to pre-install the ComfyUI-GGUF custom node. For the Florence-2 model, install ComfyUI-Florence2 in advance.
Configuring ComfyUI-MultiGPU!
Here’s how to set up ComfyUI-MultiGPU:

Size | Processing Load | Load Destination | |
---|---|---|---|
Text Encoder | Large | Light | System RAM |
UNET / Transformer | Large | Heavy | VRAM |
VAE | Small | Heavy | VRAM |
Florence-2 | Small | Medium | VRAM |
Model Load Destinations
- cuda: VRAM
- cpu: System RAM
Among the model components, the text encoder has a light processing load, so loading it into system RAM doesn’t significantly affect processing time.
On the other hand, UNET/Transformer and VAE require heavy processing, so it’s faster to use the main VRAM, even if it involves swapping models.
When Using Multiple GPUs
When equipped with multiple GPUs, computations are still handled by the main GPU. However, with ComfyUI-MultiGPU, you can freely specify the model load destination.
flowchart TB
subgraph Main GPU
A1(GPU)
A2(VRAM)
end
subgraph Sub GPU
B1(GPU)
B2(VRAM)
end
subgraph Motherboard
C1(CPU)
C2(System RAM)
end
A1-->A2
A2-->A1
A1--->B2
B2--->A1
A1---->C2
C2---->A1
The VRAM of the sub-GPU generally operates faster than system RAM.
Usage Example
Here’s a practical example from my setup:
Configuration
- Ryzen APU (cpu)
- RTX 4060 Ti 16GB (cuda:0)
- GTX 1070 8GB (cuda:1)
Workflow
flowchart LR
subgraph Main Workflow
A1(SDXL)
B1(AuraFlow)
C1(SDXL)
E1(Flux1)
F1(Finished)
end
subgraph Florence-2
D1(Florence-2)
end
A1-->B1
B1-->C1
D1-.->C1
C1-->E1
D1-.->E1
E1-->F1
Settings
- UNET / Transformer, VAE → cuda:0
- Florence-2, Text Encoder (AuraFlow) → cuda:1
- Text Encoder (Flux.1 / SDXL) → cpu
For UNET/Transformer and VAE, using the main VRAM is faster, even with model swapping.
Meanwhile, Florence-2 runs reasonably fast when loaded into the sub-GPU’s VRAM.

Since the sub-GPU’s VRAM still has capacity after loading Florence-2, I’ve also assigned AuraFlow’s text encoder to fit within the remaining VRAM space.
This minimizes model movement, saving processing time.
Separating Model Components!
Some components of SDXL and Flux.1 models are distributed combined, but you can easily separate them using the following method:

Separating components allows you to set individual load destinations with ComfyUI-MultiGPU. Plus, when using shared components, you only need to load them once, saving memory and storage.

2. ComfyUI-TeaCache
In the second half, I’ll introduce ComfyUI-TeaCache.

ComfyUI-TeaCache speeds up processing with two approaches:
- Using caching to skip computations (TeaCache)
- Compressing models into an optimized form for computation (torch.compile)
Speeding Up with Caching
First, let’s look at speeding up with caching.
The caching mechanism in image generation skips computations by reusing previous results when the inference is expected to yield similar outcomes as before.
How to Use TeaCache
The TeaCache node supports Flux.1 and video generation AI.

- rel_l1_thresh (0–1): How much caching to apply
- max_skip_steps (1–3): Upper limit for consecutive cache applications
Both rel_l1_thresh and max_skip_steps increase speed as their values rise, but image quality decreases accordingly.
Actual Measurements!
Now, let’s examine the balance between speed and quality when varying rel_l1_thresh and max_skip_steps with Flux.1.
Conditions
- blue_pencil-flux1-v1.0.0-BF16
- eular
- normal 30 steps
max_skip_step = 1
.png)
- X-axis: Generation time
- Y-axis: Image quality (similarity to original)
- Small numbers indicate rel_l1_thresh settings
max_skip_step = 2
.png)
max_skip_step = 3
.png)
The graphs show that higher positions indicate better quality, while moving left indicates faster generation.
In all graphs, image quality decreases proportionally to speed increases.
Comparing max_skip_step settings, the balance doesn’t change significantly, so when using TeaCache, the maximum setting of max_skip_steps = 3 seems ideal.
Comparing Actual Images
From the graphs, TeaCache’s maximum settings reduce processing time to about one-third.
Now, let’s compare the actual illustrations to assess quality:

Left: Original image Right: Image generated in one-third the time
Comparing the illustrations, the cached image shows inaccuracies in fingers and text, and the background is simplified.
While TeaCache offers significant speed improvements, whether the right image is acceptable depends on the use case.
Personally, I think maximizing TeaCache for rough sketches and avoiding caching for final outputs is a good approach.
Comparison with WaveSpeed’s First Block Cache
As a similar feature to TeaCache, I previously introduced WaveSpeed’s First Block Cache.
First Block Cache is an extended development of TeaCache and can be used with SDXL and SD 3.5, not just Flux.1.
First Block Cache allows finer adjustments than TeaCache, and with proper settings, it can reduce quality degradation more effectively.
.png)
However, at its maximum settings, First Block Cache doesn’t produce satisfactory images, so it requires more tuning time than TeaCache.
For ease of use, go with TeaCache; for pursuing both speed and quality, First Block Cache is a better choice.
Introducing torch.compile
ComfyUI-TeaCache also allows further speed improvements with the torch.compile function.
Since implementing torch.compile is somewhat complex, refer to Akkyoss’s article for details:
torch.compile compresses models into an optimized form before computation.
While the first image generation is slower with torch.compile, subsequent generations become faster.
torch.compile Settings
Nodes that support torch.compile include ComfyUI’s default TorchCompileModel node and ComfyUI-TeaCache’s Compile Model node.
In my environment, the TorchCompileModel node showed no noticeable effect, so here I’ll introduce the Compile Model node.

Setting Values
- mode
- default: Default optimization, balanced settings
- max_autotune: Maximum performance
- backend
- inductor: PyTorch standard acceleration
- cudagraph: NVIDIA CUDA-specific acceleration
- fullgraph: Compiles the entire model, increases speed, increases compile time
- dynamic: Variable-size input, supported only by inductor
Speed Improvements with torch.compile
Here are the generation times for the second and subsequent images in my environment with the CompileModel node:
mode | backend | fullgraph | dynamic | Time (%) | MAE Quality | SSIM Quality |
---|---|---|---|---|---|---|
default | inductor | false | false | - 0.1 % | - 7.3 % | - 12.3 % |
true | false | - 14.1 % | - 7.4 % | - 12.4 % | ||
false | true | - 14.1 % | - 7.4 % | - 12.4 % | ||
true | true | - 13.9 % | - 7.4 % | - 12.4 % | ||
cudagraph | false | false | - 14.0 % | - 0.0 % | - 0.0 % | |
true | false | - 14.0 % | - 0.0 % | - 0.0 % | ||
false | true | - 14.9 % | - 0.0 % | - 0.0 % | ||
true | true | - 14.0 % | - 0.0 % | - 0.0 % | ||
max_autotune | inductor | false | false | - 0.6 % | - 7.3 % | - 12.3 % |
true | false | - 0.1 % | - 7.3 % | - 12.3 % | ||
false | true | - 0.0 % | - 7.4 % | - 12.3 % | ||
true | true | - 0.0 % | - 7.3 % | - 12.3 % | ||
cudagraph | false | false | - 0.0 % | - 0.0 % | - 0.0 % | |
true | false | - 0.1 % | - 0.0 % | - 0.0 % | ||
false | true | - 0.0 % | - 0.0 % | - 0.0 % | ||
true | true | - 0.1 % | - 0.0 % | - 0.0 % |
The CompileModel node achieved about 14% speedup when mode was set to default, but there was no change with max_autotune.
When the backend was set to cudagraph, there was no quality degradation.
Using Sage Attention and FP8 Format
Lastly, I’ll introduce two additional speedup methods outside of ComfyUI-TeaCache.
Sage Attention
Sage Attention is a speedup technique using a new Python library that has recently gained popularity.
For installation, refer to Akkyoss’s article mentioned earlier.
When running, specify --use-sage-attention at ComfyUI startup.
Time (%) | MAE Quality | SSIM Quality | |
---|---|---|---|
sage attention | - 16.6 % | - 2.0 % | - 2.2 % |
SageAttention achieves about 16% speedup, with slight quality degradation.
Using FP8 Format
For users with Nvidia GPUs from the RTX 4000 series onward, using FP8 format instead of BF16 can speed things up.
- FP8e4m3 format: Current mainstream
- FP8e5m2 format: Lower precision, rarely used now
To use FP8e4m3 format, specify --fp8_e4m3fn-unet at ComfyUI startup.
Time (%) | MAE Quality | SSIM Quality | |
---|---|---|---|
BF16 | 0 % | - 0.0 % | - 0.0 % |
FP8e4m3 | - 17.1 % | - 7.1 % | - 11.7 % |
FP8e5m2 | - 17.1 % | - 7.5 % | - 11.6 % |
Using FP8 format significantly speeds up processing, but quality degrades accordingly.
Note that specifying FP8 format on an unsupported GPU may slow down processing instead.
Reference: Quality Comparison of FP8 Formats
Summary: The Tradeoff Between Speed and Quality
Speedup | Degradation | |
---|---|---|
ComfyUI-MultiGPU | 🔵 | None |
ComfyUI-TeaCache | 🔵 | +++ |
ComfyUI-WaveSpeed | 🔵 | +++ |
torch.compile | 🔺 | None to ++ |
SageAttention | 🔺 | + |
FP8 Format | 🔺 | +++ |
- ComfyUI-MultiGPU is an essential node
- TeaCache and WaveSpeed offer significant effects but reduce quality
- torch.compile can avoid degradation with proper settings
Among the custom nodes introduced, ComfyUI-MultiGPU has no downsides, making it a must-have.
TeaCache and WaveSpeed, which use caching, offer substantial benefits but degrade quality, so they’re great for rough sketches.
For torch.compile, I recommend trying it in your environment and adopting it if there’s no quality loss.

Since the methods introduced work through different mechanisms, combining them can yield synergistic effects.
Many speedup features are described as “speeding up at the cost of slight quality degradation,” but it’s important to understand their actual impact in your environment.
I’ve also made available a page for numerically comparing image differences used in this analysis, so feel free to try it out:
Thank you for reading to the end!