Is Your Illustration High-Quality? Comparing T5xxl and CLIP-L with Real Data!


- A new image comparison tool is available.
- Improved CLIP-L in FP32 format is highly recommended.
- Choose Flan-T5xxl based on your system RAM capacity.
Introduction
Hello, I’m Easygoing.
Today, let’s explore how to compare illustrations.
Theme: Winter Sports
This time, the theme is winter sports.

I’ll depict a scene of enjoying time with friends at a ski resort in an illustration.
Image Difference Checker
To objectively evaluate image differences, I’ve created a new web page.
Here’s the page I made:
This page takes two images as input and outputs the following:

- Difference map (color and grayscale)
- Mean Absolute Error (MAE): Average absolute error
- Structural Similarity Index (SSIM): Structural similarity index
By comparing these maps and numbers, you can objectively assess differences between two images.
Note that this tool detects "differences," so determining which illustration is better requires human judgment.
Let’s Compare!
Now, let’s use this tool.
Since I’ve been writing about text encoders lately, I’ll compare text encoders again this time.
flowchart LR
subgraph Input
X1(Prompt)
end
subgraph Flux.1
D1(CLIP-L)
D2(T5xxl)
D3(Transformer)
end
X1-->D1
X1-->D2
D2-->D1
D1-->D3
Flan-T5xxl: The Latest Version of T5xxl
- An improved version of T5xxl_v1.1
- Enhanced performance through additional training with instructions and answers
- Expected to improve prompt comprehension
First, let’s compare Flan-T5xxl.
Flan-T5xxl is an enhanced version of T5xxl, which understands prompt context, improved through additional training with instructions and answers.
Since image generation prompts are treated as instructions, this training should improve prompt fidelity.
How does Flan-T5xxl’s precision vary with compression formats?
Flan-T5xxl-FP32 (32-bit)

Flan-T5xxl-FP32 is the highest-precision version available online, capable of producing incredibly detailed illustrations.
This serves as the baseline for our comparisons. The black image on the right represents the colored difference map.
Flan-T5xxl-FP16 (16-bit)


There are minimal differences compared to FP32.
Flan-T5xxl-Q8_0.gguf (8-bit)


The illustration is quite similar, but there’s a subtle difference in the depiction of the right hand's fingers.
Flan-T5xxl-Q5_K_M (5-bit)
Q5_K_M shows noticeable overall differences, with visible degradation.


Flan-T5xxl-Q3_K_L (3-bit)


At Q3_K_L compression, the text and other details change significantly.
Results: Graph Representation

Flan-T5xxl | Size(GB) | MAE | Similarity | SSIM | Similarity |
---|---|---|---|---|---|
FP32 | 45.2 | 0.00 | 100.0 % | 1.00 | 100.0 % |
FP16 | 22.6 | 0.96 | 99.6 % | 1.00 | 99.9 % |
Q8_0 | 11.8 | 1.26 | 99.5 % | 1.00 | 99.8 % |
Q6_K | 9.2 | 1.57 | 99.4 % | 1.00 | 99.7 % |
Q5_K_M | 8 | 4.62 | 98.2 % | 0.98 | 98.4 % |
Q4_K_M | 6.9 | 9.08 | 96.5 % | 0.95 | 95.2 % |
Q3_K_L | 5.7 | 17.11 | 93.3 % | 0.85 | 84.9 % |
Q2_K | 4.1 | 11.93 | 95.3 % | 0.94 | 93.6 % |
With Flan-T5xxl, reducing capacity through compression lowers precision accordingly.
Degradation becomes noticeable below Q5_K_M, so it’s best to use Q6_K or higher if possible.
T5xxl_v1.1: The Default T5xxl
- A version of T5xxl from 2021
- Standard in Flux.1 and SD 3.5
- The most widely used in image generation
Next, let’s compare T5xxl_v1.1.
T5xxl_v1.1 is an older version than Flan-T5xxl but is likely the most widely used since it’s officially distributed.
Both FP8 and GGUF formats are available for T5xxl_v1.1, so I’ll compare them.
T5xxl_v1.1-FP32

The original FP32 version is publicly available on Google’s Hugging Face page.
T5xxl_v1.1-FP16


The FP16 version, distributed via ComfyUI’s Hugging Face page, is likely the most commonly used in Flux.1.
Compared to FP32, there are minor differences.
T5xxl_v1.1-FP8e4m3fn_scaled (8-bit)


This version, also from ComfyUI’s page, refines the FP8e4m3 format for improved performance.
T5xxl_v1.1-FP8e4m3fn (8-bit)


The simpler FP8e4m3 format shows noticeable degradation.
T5xxl_v1.1-Q8_0.gguf (8-bit)


This lightweight GGUF version, shared by City69, is also widely used.
T5xxl_v1.1-Q5_K_M.gguf (5-bit)


The 5-bit GGUF format, recommended by City69, represents the lowest acceptable quality for many users.
T5xxl_v1.1-Q3_K_L.gguf (3-bit)


A 3-bit GGUF format.
At this level of compression, performance drops are clear in image generation, though it’s still considered usable for large language models.
Results Graph!

T5xxl_v1.1 | Size(GB) | MAE | Similarity | SSIM | Similarity |
---|---|---|---|---|---|
FP32 | 44.6 | 0 | 100.0 % | 1 | 100.0 % |
FP16 | 9.8 | 1.87 | 99.3 % | 1 | 99.7 % |
FP8_e4m3fn_scaled | 5.2 | 3.82 | 98.5 % | 0.99 | 99.1 % |
FP8_e4m3fn | 4.9 | 6.31 | 97.5 % | 0.98 | 97.6 % |
Q8_0 | 5.1 | 4.88 | 98.1 % | 0.99 | 98.8 % |
Q5_K_M | 3.4 | 5.12 | 98.0 % | 0.98 | 98.1 % |
Q3_K_L | 2.5 | 5.25 | 97.9 % | 0.98 | 98.5 % |
Actual Precision Ranking
- FP32
- FP16
- FP8e4m3fn_scaled
- Q8_0
- Q5_K_M
- Q3_K_L
- FP8e4m3fn
T5xxl_v1.1 yielded some surprising results.
While GGUF formats are generally said to outperform FP8 at 8-bit, FP8e4m3fn_scaled’s recalibration significantly boosts its performance, surpassing Q8_0.gguf.
Meanwhile, the unadjusted FP8e4m3fn format performs worse than Q3_K_L, reinforcing GGUF’s precision advantage.
This shows that recalibration after lightening a model can drastically affect performance.
CLIP-L Comparison
Next, let’s compare CLIP-L. Though smaller in size than T5xxl, CLIP-L directly links text and images, so it impacts illustrations more significantly.
First, I’ll compare the FP32 and FP16 formats of each CLIP-L variant.
Long-CLIP-ViT-L-14-GmP-SAE (Released December 19, 2024!)

The Long-CLIP-ViT-L-14-GmP-SAE model is the latest Long-CLIP-L, released on December 19, 2024.
The SAE: Sparse Autoencoder model is a bold departure from previous versions. While benchmark scores are slightly lower, it’s designed to enhance creativity.
Both formats produce clear, beautiful illustrations, but there’s a noticeable precision difference between FP32 and FP16.
Note that Long-CLIP-L models are currently only usable in ComfyUI.
Also, using FP32 text encoders requires the --fp32-text-enc setting I’ve covered before.
How to Use Enhanced CLIP-L and Flan-T5xxl
CLIP-SAE-GmP-ViT-L-14 (Released December 8, 2024!)

CLIP-SAE-GmP-ViT-L-14 is an SAE model of CLIP-L, released on December 8, 2024.
This one works in both ComfyUI and Stable Diffusion webUI Forge.
Again, there’s a difference between FP32 and FP16 formats.
Standard CLIP-L

The officially distributed standard CLIP-L.
It also shows differences between FP32 and FP16 formats.
Enhanced CLIP-L vs. Standard CLIP-L
Finally, let’s directly compare enhanced CLIP-L and standard CLIP-L.

Comparing the illustrations, the enhanced CLIP-L on the left is clearer overall with finer details, especially around the ski board area.
Of the two metrics calculated, SSIM is said to correlate well with human perception, and it differs significantly between the two.
Though much smaller than T5xxl, CLIP-L directly affects image generation, greatly influencing quality.
Text Encoders Are Worth Using in FP32 Format
This experiment showed precision differences between FP32 and FP16 formats across all text encoders.

Upgrading to CLIP-L and Flan-T5xxl has significant benefits with no downsides, so I highly recommend trying them!
Conclusion: Text Encoders Make a Huge Difference!
- Released a page to compare images numerically
- Recommend FP32 format for enhanced CLIP-L
- Choose Flan-T5xxl based on system RAM capacity
By quantifying image differences, we can now compare model precision more accurately.
This experiment revealed that text encoders significantly affect image quality.
Since prompt encoding is the uppermost step in image generation, errors here might amplify downstream.

Even on lower-spec PCs, targeted upgrades like enhanced CLIP-L or FP32 formats can greatly improve quality.
I’m grateful that Flan-T5xxl and enhanced CLIP-L are open-source and look forward to enjoying image generation with them.
Thank you for reading to the end!