Is Your Illustration High-Quality? Comparing T5xxl and CLIP-L with Real Data!

Anime illustration of a girl with white hair and blue eyes wearing an orange ski outfit with a blackboard with “Flan T5xxl” written on her chest looking at us with a smile on a ski slope
  • A new image comparison tool is available.
  • Improved CLIP-L in FP32 format is highly recommended.
  • Choose Flan-T5xxl based on your system RAM capacity.

Introduction

Hello, I’m Easygoing.

Today, let’s explore how to compare illustrations.

Theme: Winter Sports

This time, the theme is winter sports.

Anime illustration of a girl with brown hair and blue eyes looking at you with a smile on a ski slope at sunset.png (2568×2568)

I’ll depict a scene of enjoying time with friends at a ski resort in an illustration.

Image Difference Checker

To objectively evaluate image differences, I’ve created a new web page.

Here’s the page I made:

This page takes two images as input and outputs the following:

T5xxl_FP16_vs_Flan-T5xxl-FP32.png (2065×2710)
  • Difference map (color and grayscale)
  • Mean Absolute Error (MAE): Average absolute error
  • Structural Similarity Index (SSIM): Structural similarity index

By comparing these maps and numbers, you can objectively assess differences between two images.

Note that this tool detects "differences," so determining which illustration is better requires human judgment.

Let’s Compare!

Now, let’s use this tool.

Since I’ve been writing about text encoders lately, I’ll compare text encoders again this time.


flowchart LR
subgraph Input
X1(Prompt)
end
subgraph Flux.1
D1(CLIP-L)
D2(T5xxl)
D3(Transformer)
end
X1-->D1
X1-->D2
D2-->D1
D1-->D3

Flan-T5xxl: The Latest Version of T5xxl

  • An improved version of T5xxl_v1.1
  • Enhanced performance through additional training with instructions and answers
  • Expected to improve prompt comprehension

First, let’s compare Flan-T5xxl.

Flan-T5xxl is an enhanced version of T5xxl, which understands prompt context, improved through additional training with instructions and answers.

Since image generation prompts are treated as instructions, this training should improve prompt fidelity.

How does Flan-T5xxl’s precision vary with compression formats?

Flan-T5xxl-FP32 (32-bit)

Flan-T5xxl-FP32_LongCLIP-SAE-ViT-L-14-FP32.png (1440×1440)

Flan-T5xxl-FP32 is the highest-precision version available online, capable of producing incredibly detailed illustrations.

This serves as the baseline for our comparisons. The black image on the right represents the colored difference map.

Flan-T5xxl-FP16 (16-bit)

Flan-T5xxl-FP16_LongCLIP-SAE-ViT-L-14-FP32.png (1440×1440) Color-Diff_Flan-T5xxl-FP16_LongCLIP-SAE-ViT-L-14-FP32_Flan-T5xxl-FP32_LongCLIP-SAE-ViT-L-14-FP32.png (1440×1440)

There are minimal differences compared to FP32.

Flan-T5xxl-Q8_0.gguf (8-bit)

Flan-T5xxl-Q8_0.gguf_LongCLIP-SAE-ViT-L-14-FP32.png (1440×1440) Color_Diff_Flan-T5xxl-Q8_0.gguf_LongCLIP-SAE-ViT-L-14-FP32_Flan-T5xxl-FP32_LongCLIP-SAE-ViT-L-14-FP32.png (1440×1440)

The illustration is quite similar, but there’s a subtle difference in the depiction of the right hand's fingers.

Flan-T5xxl-Q5_K_M (5-bit)

Q5_K_M shows noticeable overall differences, with visible degradation.

Flan-T5xxl-Q5_K_M.gguf_LongCLIP-SAE-ViT-L-14-FP32.png (1440×1440) Color_Diff_Flan-T5xxl-Q5_K_M.gguf_LongCLIP-SAE-ViT-L-14-FP32_Flan-T5xxl-FP32_LongCLIP-SAE-ViT-L-14-FP32.png (1440×1440)

Flan-T5xxl-Q3_K_L (3-bit)

Flan-T5xxl-Q3_K_L.gguf_LongCLIP-SAE-ViT-L-14-FP32.png (1440×1440) Color_Diff_Flan-T5xxl-Q3_K_L.gguf_LongCLIP-SAE-ViT-L-14-FP32_Flan-T5xxl-FP32_LongCLIP-SAE-ViT-L-14-FP32.png (1440×1440)

At Q3_K_L compression, the text and other details change significantly.

Results: Graph Representation

Flan-T5xxl MAE and SSIM Similarity.png (1200×848)
Flan-T5xxl Size(GB) MAE Similarity SSIM Similarity
FP32 45.2 0.00 100.0 % 1.00 100.0 %
FP16 22.6 0.96 99.6 % 1.00 99.9 %
Q8_0 11.8 1.26 99.5 % 1.00 99.8 %
Q6_K 9.2 1.57 99.4 % 1.00 99.7 %
Q5_K_M 8 4.62 98.2 % 0.98 98.4 %
Q4_K_M 6.9 9.08 96.5 % 0.95 95.2 %
Q3_K_L 5.7 17.11 93.3 % 0.85 84.9 %
Q2_K 4.1 11.93 95.3 % 0.94 93.6 %

With Flan-T5xxl, reducing capacity through compression lowers precision accordingly.

Degradation becomes noticeable below Q5_K_M, so it’s best to use Q6_K or higher if possible.

T5xxl_v1.1: The Default T5xxl

  • A version of T5xxl from 2021
  • Standard in Flux.1 and SD 3.5
  • The most widely used in image generation

Next, let’s compare T5xxl_v1.1.

T5xxl_v1.1 is an older version than Flan-T5xxl but is likely the most widely used since it’s officially distributed.

Both FP8 and GGUF formats are available for T5xxl_v1.1, so I’ll compare them.

T5xxl_v1.1-FP32

T5xxl-v1_1_original-FP32_LongCLIP-SAE-ViT-L-14-FP32.png (1440×1440)

The original FP32 version is publicly available on Google’s Hugging Face page.

T5xxl_v1.1-FP16

T5xxl-v1_1_original_FP16_LongCLIP-SAE-ViT-L-14-FP32.png (1440×1440) Color_Diff_T5xxl-v1_1_FP16_LongCLIP-SAE-ViT-L-14-FP32_T5xxl-v1_1_original-FP32_LongCLIP-SAE-ViT-L-14-FP32.png (1440×1440)

The FP16 version, distributed via ComfyUI’s Hugging Face page, is likely the most commonly used in Flux.1.

Compared to FP32, there are minor differences.

T5xxl_v1.1-FP8e4m3fn_scaled (8-bit)

T5xxl-v1_1_FP8e4mefn_scaled_LongCLIP-SAE-ViT-L-14-FP32.png (1440×1440) Color_Diff_T5xxl-v1_1_FP8e4mefn_scaled_LongCLIP-SAE-ViT-L-14-FP32_T5xxl-v1_1_original-FP32_LongCLIP-SAE-ViT-L-14-FP32.png.png (1440×1440)

This version, also from ComfyUI’s page, refines the FP8e4m3 format for improved performance.

T5xxl_v1.1-FP8e4m3fn (8-bit)

T5xxl-v1_1_FP8e4mefn_LongCLIP-SAE-ViT-L-14-FP32.png (1440×1440) Color_DiffT5xxl-v1_1_FP8e4mefn_LongCLIP-SAE-ViT-L-14-FP32_T5xxl-v1_1_original-FP32_LongCLIP-SAE-ViT-L-14-FP32.png (1440×1440)

The simpler FP8e4m3 format shows noticeable degradation.

T5xxl_v1.1-Q8_0.gguf (8-bit)

T5xxl-v1_1_Q8_0_gguf_LongCLIP-SAE-ViT-L-14-FP32.png (1440×1440) Color_Diff_T5xxl-v1_1_Q8_0_gguf_LongCLIP-SAE-ViT-L-14-FP32_T5xxl-v1_1_original-FP32_LongCLIP-SAE-ViT-L-14-FP32.png (1440×1440)

This lightweight GGUF version, shared by City69, is also widely used.

T5xxl_v1.1-Q5_K_M.gguf (5-bit)

T5xxl-v1_1_Q5_K_M_gguf_LongCLIP-SAE-ViT-L-14-FP32.png (1440×1440) Color_Diff_T5xxl-v1_1_Q5_K_M_gguf_LongCLIP-SAE-ViT-L-14-FP32_T5xxl-v1_1_original-FP32_LongCLIP-SAE-ViT-L-14-FP32.png (1440×1440)

The 5-bit GGUF format, recommended by City69, represents the lowest acceptable quality for many users.

T5xxl_v1.1-Q3_K_L.gguf (3-bit)

T5xxl-v1_1_Q3_K_L_gguf_LongCLIP-SAE-ViT-L-14-FP32.png (1440×1440) Color_Diff_T5xxl-v1_1_Q3_K_L_gguf_LongCLIP-SAE-ViT-L-14-FP32_T5xxl-v1_1_original-FP32_LongCLIP-SAE-ViT-L-14-FP32.png (1440×1440)

A 3-bit GGUF format.

At this level of compression, performance drops are clear in image generation, though it’s still considered usable for large language models.

Results Graph!

T5xxl_v1.1 MAE and SSIM Similarity.png (1200×848)
T5xxl_v1.1 Size(GB) MAE Similarity SSIM Similarity
FP32 44.6 0 100.0 % 1 100.0 %
FP16 9.8 1.87 99.3 % 1 99.7 %
FP8_e4m3fn_scaled 5.2 3.82 98.5 % 0.99 99.1 %
FP8_e4m3fn 4.9 6.31 97.5 % 0.98 97.6 %
Q8_0 5.1 4.88 98.1 % 0.99 98.8 %
Q5_K_M 3.4 5.12 98.0 % 0.98 98.1 %
Q3_K_L 2.5 5.25 97.9 % 0.98 98.5 %

Actual Precision Ranking

  1. FP32
  2. FP16
  3. FP8e4m3fn_scaled
  4. Q8_0
  5. Q5_K_M
  6. Q3_K_L
  7. FP8e4m3fn

T5xxl_v1.1 yielded some surprising results.

While GGUF formats are generally said to outperform FP8 at 8-bit, FP8e4m3fn_scaled’s recalibration significantly boosts its performance, surpassing Q8_0.gguf.

Meanwhile, the unadjusted FP8e4m3fn format performs worse than Q3_K_L, reinforcing GGUF’s precision advantage.

This shows that recalibration after lightening a model can drastically affect performance.

CLIP-L Comparison

Next, let’s compare CLIP-L. Though smaller in size than T5xxl, CLIP-L directly links text and images, so it impacts illustrations more significantly.

First, I’ll compare the FP32 and FP16 formats of each CLIP-L variant.

Long-CLIP-ViT-L-14-GmP-SAE (Released December 19, 2024!)

Long-CLIP-ViT-L-14-GmP-SAE_Flan-T5xxl-FP32.png (2065×2710)
Left: FP32 format, Right: FP16 format

The Long-CLIP-ViT-L-14-GmP-SAE model is the latest Long-CLIP-L, released on December 19, 2024.

The SAE: Sparse Autoencoder model is a bold departure from previous versions. While benchmark scores are slightly lower, it’s designed to enhance creativity.

Both formats produce clear, beautiful illustrations, but there’s a noticeable precision difference between FP32 and FP16.

Note that Long-CLIP-L models are currently only usable in ComfyUI.

Also, using FP32 text encoders requires the --fp32-text-enc setting I’ve covered before.

How to Use Enhanced CLIP-L and Flan-T5xxl

CLIP-SAE-GmP-ViT-L-14 (Released December 8, 2024!)

CLIP-ViT-L-14-GmP-SAE_Flan-T5xxl-FP32.png (2065×2710)

CLIP-SAE-GmP-ViT-L-14 is an SAE model of CLIP-L, released on December 8, 2024.

This one works in both ComfyUI and Stable Diffusion webUI Forge.

Again, there’s a difference between FP32 and FP16 formats.

Standard CLIP-L

CLIP-L_Flan-T5xxl-FP32.png (2065×2645)

The officially distributed standard CLIP-L.

It also shows differences between FP32 and FP16 formats.

Enhanced CLIP-L vs. Standard CLIP-L

Finally, let’s directly compare enhanced CLIP-L and standard CLIP-L.

CLIP-L-FP32_Flan-T5xxl-FP32_CLIP-ViT-L-14-GmP-SAE-FP32_Flan-T5xxl-FP32.png (1219×1600)

Comparing the illustrations, the enhanced CLIP-L on the left is clearer overall with finer details, especially around the ski board area.

Of the two metrics calculated, SSIM is said to correlate well with human perception, and it differs significantly between the two.

Though much smaller than T5xxl, CLIP-L directly affects image generation, greatly influencing quality.

Text Encoders Are Worth Using in FP32 Format

This experiment showed precision differences between FP32 and FP16 formats across all text encoders.

Anime illustration of a girl with brown hair and blue eyes holding a ski board with the word “Flan” written on it, looking surprised at a ski slope at sunset.png (2568×2568)

Upgrading to CLIP-L and Flan-T5xxl has significant benefits with no downsides, so I highly recommend trying them!

Conclusion: Text Encoders Make a Huge Difference!

  • Released a page to compare images numerically
  • Recommend FP32 format for enhanced CLIP-L
  • Choose Flan-T5xxl based on system RAM capacity

By quantifying image differences, we can now compare model precision more accurately.

This experiment revealed that text encoders significantly affect image quality.

Since prompt encoding is the uppermost step in image generation, errors here might amplify downstream.

Anime illustration of a girl with orange hair and blue eyes smiling and looking at you with her friends in front of a snowy cabin.png (2568×2568)

Even on lower-spec PCs, targeted upgrades like enhanced CLIP-L or FP32 formats can greatly improve quality.

I’m grateful that Flan-T5xxl and enhanced CLIP-L are open-source and look forward to enjoying image generation with them.

Thank you for reading to the end!