What Are CLIP and T5xxl ? How Text Encoders Can Make Illustrations Stunning!

Anime illustration of a sparrow eating bread on a girl's hand 5.png (1600×1600)
  • CLIP is the foundational technology for image generation.
  • T5xxl enhances prompt comprehension.
  • Enhanced text encoders are publicly available and ready to use.

Introduction

Hello, I’m Easygoing.

Today, let’s dive into the topic of text encoders in image generation AI.

Anime illustration of a sparrow eating bread on a girl's hand 7.png (2576×2576)

Text Encoders Are Like Dictionaries

AI takes the text we input and converts it into a format that machines can understand.


flowchart LR
subgraph Prompt
A1(Text)
end
subgraph Text Encoder
B1(Words)
B2(Tokens)
B3(Vectors)
end
subgraph Transformer / UNET
C1(Generate Image)
end
A1-->B1
B1-->B2
B2-->B3
B3-->C1

This conversion is handled by the text encoder, which acts like a dictionary translating human language into machine language.

How does changing the text encoder in image generation AI affect image quality?

Comparing Real Images!

Let’s take a look at how images change with Flux.1, a new image generation AI.

Flux.1 comes equipped with two types of text encoders.


flowchart LR
subgraph Input
X1(Prompt)
end
subgraph Flux.1
D1(CLIP-L)
D2(T5xxl)
D3(Transformer)
end
X1-->D1
X1-->D2
D2-->D1
D1-->D3

  • T5xxl: Understands the context of prompts
  • CLIP-L: Converts words into vectors

This time, we’ll replace T5xxl and CLIP-L with higher-precision versions.

T5xxl-FP16 + CLIP-L-FP16

T5xxl-FP16.png (1440×1440)
Original

Flan-T5xxl-FP16 + CLIP-L-FP16

Flan-FP16.png (1440×1440)
Improved prompt effectiveness

Flan-T5xxl-FP32 + CLIP-L-FP32

Flan_FP32_CLIP-L_FP32.png (1440×1440)
Enhanced image quality

Flan-T5xxl-FP32 + CLIP-GmP-ViT-L-14-FP32

Flan_FP32_CLIP-GmP_L_FP32.png (1440×1440)
Better background detail

Flan-T5xxl-FP32 + Long-CLIP-GmP-ViT-L-14-FP32

Flan_FP32_LongCLIP-GmP_L_FP32.png (1440×1440)
Even more detailed background

The text encoders used here improve in performance as you move down the list.

Changing the text encoder notably enhances the details of the buildings on the right, improving overall image quality.

Note that the Long-CLIP-L model at the bottom is usable in ComfyUI but not in Stable Diffusion webUI Forge.

Also, using FP32 text encoders requires the --fp32-text-enc setting, which we’ll cover later.

A Closer Look at Text Encoders

Let’s explore text encoders in more detail.

First, here are the text encoders typically found in major image generation AIs.


flowchart LR
subgraph Input
X1(Prompt)
end
subgraph Image Generation
Y1(UNET)
Y2(Transformer)
end
subgraph Flux.1
D1(CLIP-L)
D2(T5xxl)
end
subgraph Stable Diffusion 3
C1(CLIP-L)
C2(CLIP-G)
C3(T5xxl)
end
subgraph Stable Diffusion XL
B1(CLIP-L)
B2(CLIP-G)
end
subgraph Stable Diffusion 1
A1(CLIP-L)
end
X1-->A1
X1-->B1
X1-->B2
X1-->C1
X1-->C2
X1-->C3
X1-->D1
X1-->D2
C3-->C1
C3-->C2
D2-->D1
A1-->Y1
B1-->Y1
B2-->Y1
C1-->Y2
C2-->Y2
D1-->Y2

T5xxl and CLIP are text encoders, while UNET and Transformer generate images based on the analyzed information.

CLIP: The Foundation of Everything

CLIP is a fundamental technique developed by OpenAI to connect images and text.

CLIP comes in several variants based on performance.

Model Name Release Parameters Token Comprehensible Text
CLIP-B (Base) November 2021 149 million 77 Words & Short Sentences
CLIP-L (Large) January 2022 355 million 77 Words & Short Sentences
Long-CLIP-L April 2024 355 million 248 Long Sentences
CLIP-G (Giant) January 2023 750 million 77 Long Sentences

The base training database for image generation AI is LAION-5B, a dataset of 5 billion images, captioned using CLIP-B.

CLIP-L is an improved version of CLIP-B, and most image generation AIs use CLIP-L.

Long-CLIP-L is an enhanced model of CLIP-L, modified to handle longer text.

CLIP-G improves the overall performance of CLIP-L; while the token count remains the same, it emphasizes key elements, enabling it to understand prompts longer than 200 words.

T5xxl: Understanding Context

T5xxl is a text-to-text generation model developed by Google, serving as a foundational technology for today’s AI services like chatbots and translation AIs.

While T5xxl can theoretically handle very long text, its accuracy decreases as the text lengthens.

Model Name Release Parameters Token Comprehensible Text
T5xxl October 2020 11 billion 32,000 Long Sentences & Context
T5xxl v1.1 June 2021 11 billion 32,000 Long Sentences & Context
Flan-T5xxl October 2022 11 billion 32,000 Long Sentences & Context

T5xxl v1.1 and Flan-T5xxl have the same number of parameters, but efficient additional training has improved overall accuracy.

The Rise of Multiple Text Encoders

Newer image generation AIs are equipped with multiple text encoders to enhance prompt comprehension accuracy.

Stable Diffusion 1: Word-Based Understanding


flowchart LR
subgraph Input
X1(Prompt)
end
subgraph Stable Diffusion 1
A1(CLIP-L)
Y1(UNET)
end
X1-->A1
A1-->Y1

Released in July 2022, Stable Diffusion 1 used CLIP-L as its sole text encoder.
Due to CLIP-L’s limited token capacity, users had to structure prompts as short keywords and place important ones at the start.

Stable Diffusion XL: Understanding Longer Prompts


flowchart LR
subgraph Input
X1(Prompt)
end
subgraph Stable Diffusion XL
B1(CLIP-L)
B2(CLIP-G)
Y1(UNET)
end
X1-->B1
X1-->B2
B1-->Y1
B2-->Y1

Stable Diffusion XL, launched in July 2023, added CLIP-G alongside CLIP-L.
CLIP-G improved prompt comprehension, enabling users to write longer natural-language prompts.

With a total model size of 7GB, 1.8GB is dedicated to text encoders, highlighting their significance.

Stable Diffusion 3: Contextual Understanding


flowchart LR
subgraph Input
X1(Prompt)
end
subgraph Stable Diffusion 3
C1(CLIP-L)
C2(CLIP-G)
C3(T5xxl)
Y2(Transformer)
end
X1-->C1
X1-->C2
X1-->C3
C3-->C1
C3-->C2
C1-->Y2
C2-->Y2

In June 2024, Stable Diffusion 3 introduced three text encoders: CLIP-L, CLIP-G, and T5xxl.
This setup improved its ability to understand context.

T5xxl is powerful but large, requiring 9GB even in compressed FP16 format.

Anime illustration of a sparrow flying over the suburbs at dusk

The growing size of text encoders led to the practice of separating text encoders from the main model in Stable Diffusion 3.

Flux.1: Lacking CLIP-G


flowchart LR
subgraph Input
X1(Prompt)
end
subgraph Flux.1
D1(CLIP-L)
D2(T5xxl)
D3(Transformer)
end
X1-->D1
X1-->D2
D2-->D1
D1-->D3

Released in August 2024, Flux.1 uses CLIP-L and T5xxl but does not include CLIP-G.
This may be because T5xxl covers much of CLIP-G’s functionality.

SD3.5 prompt_adherence graph.png (2500×1473)
https://stability.ai/news/introducing-stable-diffusion-3-5

Based on the absence of CLIP-G, Stability AI claims that Stable Diffusion 3.5 outperforms Flux.1 in language understanding, but in practice, Flux.1 rarely feels lacking in prompt comprehension.

Even the previous-generation CLIP-G offers sufficient practical understanding for long prompts.

Enhanced Text Encoders!

Here are the links to the enhanced text encoders used this time.

Enhanced CLIP-L

CLIP-GmP-ViT-L-14

CLIP-GmP-ViT-L-14, developed by Zer0int, is a refined version of CLIP-L released for free.

The developer states that they made this because they simply love CLIP. They used an RTX 4090 on their home PC for training.

CLIP-GmP-ViT-L-14 improves the accuracy of CLIP-L using Global mean Pooling (GmP). It achieved a 90% accuracy on ImageNet/ObjectNet benchmarks compared to CLIP-L’s 85%.

Improvements to CLIP-GmP-ViT-L-14.png (1820×1731)
zer0int/CLIP-fine-tune: Fine-tuning code for CLIP models

According to Zer0int, CLIP-GmP-ViT-L-14 addresses the excessive focus of CLIP-L in image understanding.

The Hugging Face page for CLIP-GmP-ViT-L-14 includes both the original FP32 version and an improved ViT-L-14-BEST-smooth-GmP-TE-only-HF-format.safetensors FP16 version.

Screenshot of CLIP-GmP-ViT-L-14 download page with comment.png (3780×2260)

If you’re unsure which to choose, the FP16 version is a good starting point.

Long-CLIP-GmP-ViT-L-14 (ComfyUI Only)

This model extends the standard CLIP-L’s token limit from 77 to 248, allowing it to handle longer prompts.

Currently, it is only compatible with ComfyUI and cannot be used in Stable Diffusion WebUI Forge.

Screenshot of Long-CLIP-GmP-ViT-L-14 download page with comment.png (3780×2232)

The download page offers the original FP32 version, as well as the performance-enhanced FP16 version, Long-ViT-L-14-BEST-GmP-smooth-ft.safetensors.

Flan-T5xxl (Enhanced T5xxl)

Next up is the enhanced version of T5xxl. Flan-T5xxl is a model with improved accuracy through additional training on the standard T5xxl.

Flan-T5xxl Original (Split Version)

The original Flan-T5xxl, released by Google, is distributed in split files due to its large size (44GB in FP32 format).

Flan-T5xxl Merged Version

A merged file based on the original, made usable for image generation AI.

In addition to the basic merged version, a TE-only version extracting only the text encoder portion for use in Flux.1 / SD 3.5 is also available.

How to Use the Flan-T5xxl Model

Place the downloaded files in one of the following folders:

  • Installation Folder/models/text_encoder
  • Installation Folder/models/clip
  • Installation Folder/Models/CLIP
Anime illustration of a sparrow flying over the suburbs at dusk 6.png (2576×2576)

When using these models, select them as replacements for T5xxl and CLIP-L during setup.

Using FP32 Text Encoders

Text encoders are typically processed in FP16 format.

To process in FP32 format, you need to set the following at ComfyUI startup:

Screenshot of enabling --fp32-text-enc in Stability Matrix with comment.png (1144×1866)
Example setting in Stability Matrix

Note that with this setting, if both FP32 and FP16 text encoders are used, both will be processed in FP32. However, since encoding typically completes in seconds, this isn’t a significant issue.

Upgrading Text Encoders in SDXL Is Challenging

This time, I upgraded the text encoder in Flux.1.

Anime illustration of a sparrow flying over the suburbs at dusk 1.png (2579×2579)

Flux.1 and SD 3.5 allow easy upgrades because their text encoders are separate, but SDXL and SD 1.5 have them integrated into the main model, making upgrades more difficult.

I’ll cover this in detail in a future article!

Conclusion: Try Changing Your Text Encoder!

  • CLIP converts text into vectors
  • T5xxl understands context
  • Enhanced text encoders are publicly available
sparrowss o wet road

When thinking about image generation quality, we often focus on the transformer part that generates images, sidelining the text encoder.

This experiment showed that text encoders significantly impact image quality too.

Anime illustration of a sparrow flying over the suburbs at dusk 3.png (2579×2579)

The enhanced CLIP-L introduced here noticeably improves image quality despite its small size, so I recommend giving it a try.

Thank you for reading to the end!