[ComfyUI Advanced] Speeding Up with Custom Nodes! The Tradeoff Between Speed and Quality

a close - up of a female animated character with silver hair and purple eyes wearing a black outfit with intricate lace patterns surrounded by purple flowers and set against a dark background with a h.png
  • ComfyUI-MultiGPU is an essential node
  • TeaCache and WaveSpeed offer significant effects but reduce quality
  • torch.compile can avoid degradation with proper settings

Introduction

Hello, I’m Easygoing.

In this advanced edition of ComfyUI, we’ll explore how to speed things up using custom nodes.

Custom Nodes for Speeding Up

The two custom nodes I’ll introduce this time are as follows:

  1. ComfyUI-MultiGPU (Optimizes VRAM management)
  2. ComfyUI-TeaCache (Uses caching for speed improvements)

1. ComfyUI-MultiGPU

First up is ComfyUI-MultiGPU.

ComfyUI-MultiGPU is a custom node that optimizes VRAM management.

With ComfyUI-MultiGPU, you can utilize the VRAM of multiple GPUs, improving memory usage efficiency.

Additionally, you can set the model loading destination to system RAM instead of VRAM, making it an essential node even if you only have one GPU.

Installing ComfyUI-MultiGPU

Now, let’s look at how to use ComfyUI-MultiGPU.

First, install ComfyUI-MultiGPU using ComfyUI-Manager.

Search screen for multigpu of custom nodes in comfyui manager with comment.png (1600×1007)

When using GGUF format files, you’ll need to pre-install the ComfyUI-GGUF custom node. For the Florence-2 model, install ComfyUI-Florence2 in advance.

Configuring ComfyUI-MultiGPU!

Here’s how to set up ComfyUI-MultiGPU:

Screenshot of example usage of comfyui multigpu model load node.png (1600×747)
Size Processing Load Load Destination
Text Encoder Large Light System RAM
UNET / Transformer Large Heavy VRAM
VAE Small Heavy VRAM
Florence-2 Small Medium VRAM

Model Load Destinations

  • cuda: VRAM
  • cpu: System RAM

Among the model components, the text encoder has a light processing load, so loading it into system RAM doesn’t significantly affect processing time.

On the other hand, UNET/Transformer and VAE require heavy processing, so it’s faster to use the main VRAM, even if it involves swapping models.

When Using Multiple GPUs

When equipped with multiple GPUs, computations are still handled by the main GPU. However, with ComfyUI-MultiGPU, you can freely specify the model load destination.

flowchart TB
subgraph Main GPU
A1(GPU)
A2(VRAM)
end
subgraph Sub GPU
B1(GPU)
B2(VRAM)
end
subgraph Motherboard
C1(CPU)
C2(System RAM)
end
A1-->A2
A2-->A1
A1--->B2
B2--->A1
A1---->C2
C2---->A1

The VRAM of the sub-GPU generally operates faster than system RAM.

Usage Example

Here’s a practical example from my setup:

Configuration

  • Ryzen APU (cpu)
  • RTX 4060 Ti 16GB (cuda:0)
  • GTX 1070 8GB (cuda:1)

Workflow

flowchart LR
subgraph Main Workflow
A1(SDXL)
B1(AuraFlow)
C1(SDXL)
E1(Flux1)
F1(Finished)
end
subgraph Florence-2
D1(Florence-2)
end
A1-->B1
B1-->C1
D1-.->C1
C1-->E1
D1-.->E1
E1-->F1

Settings

  • UNET / Transformer, VAE → cuda:0
  • Florence-2, Text Encoder (AuraFlow) → cuda:1
  • Text Encoder (Flux.1 / SDXL) → cpu

For UNET/Transformer and VAE, using the main VRAM is faster, even with model swapping.

Meanwhile, Florence-2 runs reasonably fast when loaded into the sub-GPU’s VRAM.

an animated female character with silver hair purple eyes and a black outfit is depicted against a dark background with swirling pink and purple patterns giving off a dreamy and ethereal ambiance. the.png

Since the sub-GPU’s VRAM still has capacity after loading Florence-2, I’ve also assigned AuraFlow’s text encoder to fit within the remaining VRAM space.

This minimizes model movement, saving processing time.

Separating Model Components!

Some components of SDXL and Flux.1 models are distributed combined, but you can easily separate them using the following method:

Workflow to separate each component of the model.png (1600×810)

Separating components allows you to set individual load destinations with ComfyUI-MultiGPU. Plus, when using shared components, you only need to load them once, saving memory and storage.

Two-frame anime illustration depicting a silver-haired wizard 2.png (1600×1131)

2. ComfyUI-TeaCache

In the second half, I’ll introduce ComfyUI-TeaCache.

comfyui-manager's comfyui-teacache custom node search screen with comment.png (1600×825)

ComfyUI-TeaCache speeds up processing with two approaches:

  • Using caching to skip computations (TeaCache)
  • Compressing models into an optimized form for computation (torch.compile)

Speeding Up with Caching

First, let’s look at speeding up with caching.

The caching mechanism in image generation skips computations by reusing previous results when the inference is expected to yield similar outcomes as before.

How to Use TeaCache

The TeaCache node supports Flux.1 and video generation AI.

Screenshot of teacache custom node in comfyui.png (1600×821)
  • rel_l1_thresh (0–1): How much caching to apply
  • max_skip_steps (1–3): Upper limit for consecutive cache applications

Both rel_l1_thresh and max_skip_steps increase speed as their values rise, but image quality decreases accordingly.

Actual Measurements!

Now, let’s examine the balance between speed and quality when varying rel_l1_thresh and max_skip_steps with Flux.1.

Conditions

  • blue_pencil-flux1-v1.0.0-BF16
  • eular
  • normal 30 steps

max_skip_step = 1

TeaCache (max_skip_1).png (1200×848)
  • X-axis: Generation time
  • Y-axis: Image quality (similarity to original)
  • Small numbers indicate rel_l1_thresh settings

max_skip_step = 2

TeaCache (max_skip_2).png (1200×848)

max_skip_step = 3

TeaCache (max_skip_3).png (1200×848)

The graphs show that higher positions indicate better quality, while moving left indicates faster generation.

In all graphs, image quality decreases proportionally to speed increases.

Comparing max_skip_step settings, the balance doesn’t change significantly, so when using TeaCache, the maximum setting of max_skip_steps = 3 seems ideal.

Comparing Actual Images

From the graphs, TeaCache’s maximum settings reduce processing time to about one-third.

Now, let’s compare the actual illustrations to assess quality:

skip_3_rel_l1_0.png_vs_skip_3_rel_l1_1.png (1258×1600)

Left: Original image Right: Image generated in one-third the time

Comparing the illustrations, the cached image shows inaccuracies in fingers and text, and the background is simplified.

While TeaCache offers significant speed improvements, whether the right image is acceptable depends on the use case.

Personally, I think maximizing TeaCache for rough sketches and avoiding caching for final outputs is a good approach.

Comparison with WaveSpeed’s First Block Cache

As a similar feature to TeaCache, I previously introduced WaveSpeed’s First Block Cache.

First Block Cache is an extended development of TeaCache and can be used with SDXL and SD 3.5, not just Flux.1.

First Block Cache allows finer adjustments than TeaCache, and with proper settings, it can reduce quality degradation more effectively.

Graph showing speed and similarity between MAE and SSIM under Dynamic Cashing RDT (start=0.2, end=0.8, max hits=5).png (1200×848)

However, at its maximum settings, First Block Cache doesn’t produce satisfactory images, so it requires more tuning time than TeaCache.

For ease of use, go with TeaCache; for pursuing both speed and quality, First Block Cache is a better choice.

Introducing torch.compile

ComfyUI-TeaCache also allows further speed improvements with the torch.compile function.

Since implementing torch.compile is somewhat complex, refer to Akkyoss’s article for details:

torch.compile compresses models into an optimized form before computation.

While the first image generation is slower with torch.compile, subsequent generations become faster.

torch.compile Settings

Nodes that support torch.compile include ComfyUI’s default TorchCompileModel node and ComfyUI-TeaCache’s Compile Model node.

In my environment, the TorchCompileModel node showed no noticeable effect, so here I’ll introduce the Compile Model node.

TorchCompileModel Screenshot of the filtered compile model node.png (1600×624)
Left: ComfyUI default node Right: ComfyUI-TeaCache custom node

Setting Values

  • mode
    • default: Default optimization, balanced settings
    • max_autotune: Maximum performance
  • backend
    • inductor: PyTorch standard acceleration
    • cudagraph: NVIDIA CUDA-specific acceleration
  • fullgraph: Compiles the entire model, increases speed, increases compile time
  • dynamic: Variable-size input, supported only by inductor

Speed Improvements with torch.compile

Here are the generation times for the second and subsequent images in my environment with the CompileModel node:

mode backend fullgraph dynamic Time (%) MAE Quality SSIM Quality
default inductor false false - 0.1 % - 7.3 % - 12.3 %
true false - 14.1 % - 7.4 % - 12.4 %
false true - 14.1 % - 7.4 % - 12.4 %
true true - 13.9 % - 7.4 % - 12.4 %
cudagraph false false - 14.0 % - 0.0 % - 0.0 %
true false - 14.0 % - 0.0 % - 0.0 %
false true - 14.9 % - 0.0 % - 0.0 %
true true - 14.0 % - 0.0 % - 0.0 %
max_autotune inductor false false - 0.6 % - 7.3 % - 12.3 %
true false - 0.1 % - 7.3 % - 12.3 %
false true - 0.0 % - 7.4 % - 12.3 %
true true - 0.0 % - 7.3 % - 12.3 %
cudagraph false false - 0.0 % - 0.0 % - 0.0 %
true false - 0.1 % - 0.0 % - 0.0 %
false true - 0.0 % - 0.0 % - 0.0 %
true true - 0.1 % - 0.0 % - 0.0 %

The CompileModel node achieved about 14% speedup when mode was set to default, but there was no change with max_autotune.

When the backend was set to cudagraph, there was no quality degradation.

Using Sage Attention and FP8 Format

Lastly, I’ll introduce two additional speedup methods outside of ComfyUI-TeaCache.

Sage Attention

Sage Attention is a speedup technique using a new Python library that has recently gained popularity.

For installation, refer to Akkyoss’s article mentioned earlier.

When running, specify --use-sage-attention at ComfyUI startup.

Time (%) MAE Quality SSIM Quality
sage attention - 16.6 % - 2.0 % - 2.2 %

SageAttention achieves about 16% speedup, with slight quality degradation.

Using FP8 Format

For users with Nvidia GPUs from the RTX 4000 series onward, using FP8 format instead of BF16 can speed things up.

  • FP8e4m3 format: Current mainstream
  • FP8e5m2 format: Lower precision, rarely used now

To use FP8e4m3 format, specify --fp8_e4m3fn-unet at ComfyUI startup.

Time (%) MAE Quality SSIM Quality
BF16 0 % - 0.0 % - 0.0 %
FP8e4m3 - 17.1 % - 7.1 % - 11.7 %
FP8e5m2 - 17.1 % - 7.5 % - 11.6 %

Using FP8 format significantly speeds up processing, but quality degrades accordingly.

Note that specifying FP8 format on an unsupported GPU may slow down processing instead.

Reference: Quality Comparison of FP8 Formats

Summary: The Tradeoff Between Speed and Quality

Speedup Degradation
ComfyUI-MultiGPU 🔵 None
ComfyUI-TeaCache 🔵 +++
ComfyUI-WaveSpeed 🔵 +++
torch.compile 🔺 None to ++
SageAttention 🔺 +
FP8 Format 🔺 +++
  • ComfyUI-MultiGPU is an essential node
  • TeaCache and WaveSpeed offer significant effects but reduce quality
  • torch.compile can avoid degradation with proper settings

Among the custom nodes introduced, ComfyUI-MultiGPU has no downsides, making it a must-have.

TeaCache and WaveSpeed, which use caching, offer substantial benefits but degrade quality, so they’re great for rough sketches.

For torch.compile, I recommend trying it in your environment and adopting it if there’s no quality loss.

an animated female character with silver hair purple eyes and a shimmering blue and purple outfit stands against a dark background with swirling purple and blue patterns exuding a dreamy and ethereal .png

Since the methods introduced work through different mechanisms, combining them can yield synergistic effects.

Many speedup features are described as “speeding up at the cost of slight quality degradation,” but it’s important to understand their actual impact in your environment.

I’ve also made available a page for numerically comparing image differences used in this analysis, so feel free to try it out:

Thank you for reading to the end!


Reference: Measurement Data