What Are CLIP and T5xxl ? How Text Encoders Can Make Illustrations Stunning!

2024-12-72025-10-24

Anime illustration of a sparrow eating bread on a girl's hand 5

CLIP converts text to vectors
T5xxl understands context
Enhanced text encoders are publicly available

Introduction

Hello, I'm Easygoing!

Today, we're diving into text encoders - the "brains" that help AI understand your prompts in image generation.

Text Encoders: The Universal Translator

AI systems need to convert the text we input into a format machines can understand and process.


flowchart LR

subgraph Prompt
A1(Text)
end

subgraph Text Encoder
B1(Words)
B2(Tokens)
B3(Vectors)
end

subgraph Transformer / UNET
C1(Generate Image)
end

A1-->B1
B1-->B2
B2-->B3
B3-->C1

Text encoders handle this translation, essentially functioning as dictionaries that convert human language into machine language.

So how does changing text encoders affect image quality in AI art generation?

Real Image Comparisons!

Let's examine how images actually change when we swap text encoders using the latest Flux.1 model.

Flux.1 comes equipped with two different text encoders:


flowchart LR
subgraph Input
X1(Prompt)
end
subgraph Flux.1
D2(CLIP-L)
D1(T5xxl)
D3(Transformer)
end
X1-->D1
X1-->D2
D2-->D3
D1-->D3

T5xxl: Primarily understands prompt context
CLIP-L: Primarily understands short phrases and individual words

Today, we'll upgrade both T5xxl and CLIP-L to higher-precision versions.

T5xxl-FP16 + CLIP-L-FP16 (original)

Flan-T5xxl-FP16 + CLIP-L-FP16

Flan-FP16 — Improved prompt effectiveness

Flan-T5xxl-FP32 + CLIP-L-FP32

Flan_FP32_CLIP-L_FP32 — Enhanced image quality

Flan-T5xxl-FP32 + CLIP-GmP-ViT-L-14-FP32

Flan_FP32_CLIP-GmP_L_FP32 — Better background detail

Flan-T5xxl-FP32 + Long-CLIP-GmP-ViT-L-14-FP32

Flan_FP32_LongCLIP-GmP_L_FP32 — Even more detailed background

The text encoders listed become progressively more powerful as you go down the list.

You can clearly see how changing text encoders improves image quality, particularly in the architectural details on the right side of the images.

Note that the bottom Long-Clip-L model can be used in ComfyUI but is not compatible with Stable Diffusion webUI Forge.

Also, using FP32 format text encoders requires the --fp32-text-enc setting mentioned later.

Deep Dive into Text Encoders

Let's take a closer look at text encoders in more detail.

Here are the main text encoders used in popular image generation AI systems:


flowchart LR
subgraph Input
X1(Prompt)
end
subgraph "Image Generation"
Y1(UNET)
Y2(Transformer)
end
subgraph Flux.1
D1(CLIP-L)
D2(T5xxl)
end
subgraph "Stable Diffusion 3"
C1(CLIP-L)
C2(CLIP-G)
C3(T5xxl)
end
subgraph "Stable Diffusion XL"
B1(CLIP-L)
B2(CLIP-G)
end
subgraph "Stable Diffusion 1"
A1(CLIP-L)
end
X1-->A1
X1-->B1
X1-->B2
X1-->C1
X1-->C2
X1-->C3
X1-->D1
X1-->D2
A1-->Y1
B1-->Y1
B2-->Y1
C1-->Y2
C2-->Y2
C3-->Y2
D1-->Y2
D2-->Y2

T5xxl and CLIP are the text encoders, while UNET and Transformer handle the actual image generation based on the analyzed information.

CLIP: The Foundation of Everything

CLIP is an AI developed by OpenAI that can understand both text and images simultaneously.

Open-CLIP was developed through open-source reverse engineering of CLIP, and nowadays "CLIP" commonly refers to the Open-CLIP implementation.

CLIP comes in several variants based on performance:

Model Name	Release	Parameters	Token	Comprehensible Text
CLIP-B (Base)	November 2021	149 million	77	Words & Short Sentences
CLIP-L (Large)	January 2022	355 million	77	Words & Short Sentences
Long-CLIP-L	April 2024	355 million	248	Long Sentences
CLIP-G (Giant)	January 2023	750 million	77	Long Sentences

CLIP-L is an improved version of CLIP-B, and most image generation AI systems use CLIP-L.

Long-CLIP-L is an enhanced version of CLIP-L designed to understand longer text passages.

CLIP-G increases the parameter count over CLIP-L to improve overall performance. While the token count remains the same, it can handle prompts over 200 words by emphasizing and reproducing important elements more effectively.

T5xxl: Understanding Context

T5xxl is a text-to-text generation model developed by Google that forms the foundational technology behind AI chat systems and translation AI services we use today.

While T5xxl can theoretically handle very long texts, accuracy still decreases as text length increases.

Model Name	Release	Parameters	Token	Comprehensible Text
T5xxl	October 2020	11 billion	32,000	Long Sentences & Context
T5xxl v1.1	June 2021	11 billion	32,000	Long Sentences & Context
Flan-T5xxl	October 2022	11 billion	32,000	Long Sentences & Context

T5xxl v1.1 and Flan-T5xxl maintain the same parameter count but achieve improved overall accuracy through efficient additional training.

The Rise of Multiple Text Encoders

Modern image generation AI systems incorporate multiple text encoders to improve prompt understanding accuracy.

Stable Diffusion 1: Understanding Words and Short Phrases


flowchart LR

subgraph Input
X1(Prompt)
end

subgraph Stable Diffusion 1
A1(CLIP-L)
Y1(UNET)
end

X1-->A1
A1-->Y1

Stable Diffusion 1, released in July 2022, used CLIP-L as its text encoder.

Since CLIP-L could understand limited tokens, prompts needed to be written in short, word-separated phrases, with important keywords placed at the beginning.

Stable Diffusion XL: Understanding Long Text


flowchart LR

subgraph Input
X1(Prompt)
end

subgraph Stable Diffusion XL
B1(CLIP-L)
B2(CLIP-G)
Y1(UNET)
end

X1-->B1
X1-->B2
B1-->Y1
B2-->Y1

Stable Diffusion XL, released in July 2023, added CLIP-G alongside the existing CLIP-L text encoder.

CLIP-G offers superior performance to CLIP-L and can understand longer passages, enabling prompts to be written in natural language long-form text.

Anime illustration of a girl taking care of bird chicks in a nest 5

While SDXL models are 7GB in size, text encoders account for 1.8GB of that, demonstrating SDXL's emphasis on prompt understanding.

Stable Diffusion 3: Understanding Context

Stable Diffusion 3, released in June 2024, added T5xxl alongside CLIP-L and CLIP-G, significantly improving text comprehension.

While T5xxl is high-performing, it's also large - the encoder alone is 9GB in size.


flowchart LR
subgraph Input
X1(Prompt)
end
subgraph "Stable Diffusion 3"
C1(CLIP-L)
C2(CLIP-G)
C3(T5xxl)
Y2(Transformer)
end
X1-->C1
X1-->C2
X1-->C3
C1-->Y2
C2-->Y2
C3-->Y2

Due to the massive size of text encoders, running text encoders separately from the main model became standard practice starting with Stable Diffusion 3.

Flux.1: No CLIP-G Included


flowchart LR
subgraph Input
X1(Prompt)
end
subgraph "Flux.1"
D1(CLIP-L)
D2(T5xxl)
Y2(Transformer)
end
X1-->D1
X1-->D2
D1-->Y2
D2-->Y2

Flux.1, released in August 2024, uses two text encoders - CLIP-L and T5xxl - but notably excludes CLIP-G.

This is likely because T5xxl adequately covers CLIP-G's functionality.

SD3.5 prompt_adherence graph — https://stability.ai/news/introducing-stable-diffusion-3-5

Based on CLIP-G's absence, Stability AI claims that Stable Diffusion 3.5 has superior language understanding compared to Flux.1. However, in practice, there are rarely situations where Flux.1's prompt understanding feels insufficient.

Even the previous generation CLIP-G provides practically sufficient understanding for long prompts.

Enhanced Text Encoders!

Let's look at the links for the improved text encoders used in today's comparison.

Enhanced CLIP-L

CLIP-GmP-ViT-L-14

zer0int - Overview

zer0int/CLIP-GmP-ViT-L-14 · Hugging Face

CLIP-GmP-ViT-L-14 is an improved CLIP-L model developed and freely shared by individual developer Zer0int.

The development motivation was simply "because I love CLIP" - they train models at home using an RTX 4090.

CLIP-GmP-ViT-L-14 uses Global mean Pooling (GmP: Geometric Parameterization) to achieve higher accuracy than standard CLIP-L. On ImageNet/ObjectNet benchmarks, it achieves 90% accuracy compared to the original CLIP-L's 85% - a significant performance improvement.

According to Zer0int, CLIP-GmP-ViT-L-14 addresses excessive fixation in image understanding that affects original CLIP-L.

Improvements to CLIP-GmP-ViT-L-14 — zer0int/CLIP-fine-tune: Fine-tuning code for CLIP models

The CLIP-GmP-ViT-L-14 download page offers multiple files, including the original FP32 version plus a further improved ViT-L-14-BEST-smooth-GmP-TE-only-HF-format.safetensors FP16 version.

Screenshot of CLIP-GmP-ViT-L-14 download page with comment

When in doubt, download this FP16 version.

Long-CLIP-GmP-ViT-L-14 (ComfyUI Only)

zer0int/LongCLIP-GmP-ViT-L-14 at main

Long-CLIP-L extends the standard CLIP-L model's 77-token limitation to support up to 248 tokens, enabling CLIP-L to handle longer prompts.

Currently, Long-CLIP-L only works with ComfyUI and cannot be used with Stable Diffusion webUI Forge.

Screenshot of Long-CLIP-GmP-ViT-L-14 download page with comment

The download page offers both the original FP32 version and an enhanced FP16 version: Long-ViT-L-14-BEST-GmP-smooth-ft.safetensors.

December 31, 2024 Update

I compared the effects of enhanced CLIP-L models with actual images.

Is Your Illustration High-Quality? Comparing T5xxl and CLIP-L with Real Data! | AI Image Journey

Flan-T5xxl (Enhanced T5xxl)

Next up is the enhanced T5xxl. Flan-T5xxl underwent additional training on top of regular T5xxl to improve accuracy.

Original Flan-T5xxl (Split Version)

google/flan-t5-xxl at main

Google's original Flan-T5xxl is distributed in split files due to its large size (44GB for FP32 format).

Flan-T5xxl Merged Version

These are merged files based on the original, prepared for use in image generation AI.

easygoing0114/flan-t5-xxl-fused at main

Beyond the simple merged version, I also distribute a TE-only version that extracts only the text encoder components used by Flux.1 / SD 3.5 / HiDream.

How to Use Flan-T5xxl Models

Releasing Flan-T5xxl_TE-only in FP32, FP16, and GGUF Formats! | AI Image Journey

Distribution formats include FP32, FP16, and GGUF formats.

Using FP32 Format Text Encoders

Text encoders normally process in FP16 format.

When using FP32 format text encoders, launch-time configuration is required.

Here's how to set this up in Stability Matrix.

ComfyUI

Add --fp32-text-enc to launch options.

Screenshot of enabling --fp32-text-enc in Stability Matrix with comment — Example setting in Stability Matrix

Stable Diffusion webUI Forge

Add --clip-in-fp32 to launch options.

SDXL Text Encoder Upgrades Are Challenging

Today we upgraded text encoders in Flux.1.

Since Flux.1 and SD 3.5 run text encoders separately, upgrades are straightforward. However, SDXL and SD 1.5 have integrated encoders, making upgrades more complex.

I'll cover this topic in a future article!

Summary: Try Upgrading Your Text Encoders!

CLIP converts text to vectors
T5xxl understands context
Enhanced text encoders are publicly available

When considering image generation quality, we often focus on the transformer components that generate images, leaving text encoders as an afterthought.

This investigation shows that text encoders significantly impact image quality as well.

Anime illustration of a sparrow flying over the suburbs at dusk 3

The enhanced CLIP-L models introduced today offer clear image quality improvements despite their compact size, so I recommend everyone give them a try.

Thank you for reading to the end!