Which Is the Best Auto-Prompt Model? A Thorough Comparison of TIPO, Cliption, and Florence-2!

Flux1_euler_ a young female character with short brown hair and blue eyes holds a red rose accompanied by a green parrot on her shoulder set against a backdrop of a field of green and red roses bathed.png (1600×1600)
  • TIPO is strong in anime illustrations.
  • CLIP-L offers broad creative flexibility.
  • Florence-2 provides precise, detailed analysis.

Introduction

Hello, this is Kimama / Easygoing.

Continuing from the previous article, I will explore methods for generating image prompts automatically.

Keywords vs. Images

In the previous article, I introduced two methods for automatically generating prompts: using keywords and using images.


flowchart LR
subgraph Input
A1(Keywords)
A2(Images)
end
subgraph AI
B1("Large Language Model (LLM)<br>Text -> Text")
B2("Vision Language Model (VLM)<br>Image -> Text")
end
C1(Prompt)
A1-->B1
A2-->B2
B1-->C1
B2-->C1

Last time, I used the highly versatile ChatGPT, but this time, I will look for LLMs and VLMs that can be used locally.

Generating Prompts from Keywords

To generate prompts from keywords, we use LLMs (Large Language Models).

Large Language Models (LLM: Text → Text)

Year Released by Parameters Open Source
T5 2019 Google ~11B Yes
GPT-3 2020 OpenAI 175B No
PaLM 2022 Google 540B No
LLaMA 2023 Meta 7B ~ 65B Yes

One of the most widely used open-source (or more precisely, open-weight) LLMs available for local use is LLaMA, which comes in various parameter sizes.

A LLaMA Model Specially Designed for Prompt Generation!

Within the LLaMA family, there is a model that has been fine-tuned on lightweight LLaMA versions for image prompt generation.

The most famous among them is TIPO.

TIPO: Text to Image with Text Presampling for Prompt Optimization

TIPO is available as a custom node in ComfyUI and as an extension for Stable Diffusion webUI (A1111, Forge, Reforge).

TIPO Excels in Anime Illustrations!

TIPO is a lightweight model with 500M or 200M parameters, trained on both image and text datasets to understand the textual characteristics of images.

TIPO has been trained on the following datasets:

  • Danbooru2023 (5M+ images, tagged, anime illustrations)
  • GBC10M (~10M images, natural language, photorealistic)
  • CoyoHD11M (~11M images, natural language, photorealistic)

Although TIPO has also been trained on photorealistic images, its strongest feature is its extensive training on anime illustrations.

Testing TIPO!

Now, let’s generate images using TIPO.

I will use the same keywords as last time: parrot and friendly, and generate eight images in sequence using a natural language prompt.

Settings

  • Temperature: 0.3
  • Top-p: 0.3
  • Top-k: 5
Four anime illustrations of friendly parrots generated by TIPO 1
Rich Variety
Four anime illustrations of friendly parrots generated by TIPO 2
Tends to Generate Anime Illustrations

TIPO’s output varies significantly, but the parrot and friendly elements are only partially reflected.

Although I didn’t include any anime-related keywords, the generated prompts frequently lean toward anime illustrations, confirming that TIPO specializes in anime-style outputs.

Generating Prompts from Images!

Next, let’s generate prompts from images.

To generate prompts from images, we use Vision-Language Models (VLMs), which understand both text and images.

Vision-Language Models (VLM: Image + Text → Text)

Year Released by Parameters Training Images Open Source
Open-CLIP-L 2022 LAION + Hugging Face ~430M ~2B Yes
GPT-4 2023 OpenAI Not Disclosed Not Disclosed No
LLaVA 2023 Meta 7B ~ 65B ~110M Yes
Florence-2 2023 Microsoft 230M ~ 771M ~126M Partially Open

Among VLMs, the most lightweight and locally usable options are CLIP-L and Florence-2.

Previously, I introduced Cliption, which uses CLIP-L to generate captions from images.

CLIP-L is trained on a large dataset, but its language comprehension is limited.

Meanwhile, Florence-2, released by Microsoft in June 2024, is as lightweight as CLIP-L but trained on highly detailed captioned datasets, giving it superior analytical and comprehension abilities.

Testing Captions!

Let’s compare captions generated by CLIP-L and Florence-2 using an anime-style illustration of a black-haired girl and a parrot.

Test Image

Anime-style illustration of a girl with a parrot

CLIP-L (LongCLIP-SAE-ViT-L-14)

A young girl with blue hair holds a colorful parrot in a lush, sunlit garden with pink and purple flowers, surrounded by a wooden fence, trees, flowers, and a bench, and a small house in the background, with warm sunlight filtering through the leaves and a serene and peaceful atmosphere. The woman wears a light-colored blouse and blue overalls.

CLIP-L’s caption is concise (about 50 words) and captures key details.

When used in iterative prompt generation, the generated images gradually change over multiple iterations.

Cliption-generated images shifting over iterations
Gradual Changes in Illustrations
Four anime illustrations of friendly parrots generated by Cliption 2
Converging into a Photorealistic Female Illustration

As the images continued to be generated, the parrot gradually transformed into a teddy bear, while the anime girl turned into a photorealistic woman.

When using Cliption for continuous generation, the final result often converges into a realistic female illustration. This suggests that the training dataset likely contained a significant number of such photos.

Florence-2 (Florence-2-large-PromptGen v2.0)

Now, let's take a look at the captions generated by Florence-2.

The Florence-2-large-PromptGen v2.0 model is a fine-tuned version of the original Florence-2, trained on additional illustrations uploaded to Civitai. This enhancement improves the model’s ability to recognize images accurately.

Florence-2 can generate three types of captions:

  1. Natural language
  2. Tags, and
  3. Analytical descriptions

each suitable for different use cases.

Anime-style illustration of a girl with a parrot 2

1. Natural Language Description

A vibrant and detailed anime-style digital illustration from a front camera angle about a girl holding a parrot in a lush garden. The image features a young girl with long, black hair and golden eyes, standing in the middle of the image. She is wearing blue overalls and a beige hoodie, with a warm and inviting expression on her face. The girl is holding a blue and green parrot close to her chest, with its beak slightly open and its eyes looking directly at the viewer. The background is filled with lush greenery, pink flowers, and a wooden fence, creating a serene and peaceful atmosphere. The lighting is soft and warm, casting gentle shadows on the girl's face and the parrot's feathers. The overall style is reminiscent of Japanese anime, with detailed shading and vibrant colors that bring the scene to life.

2. Tags

1girl, solo, long hair, looking at viewer, blush, smile, open mouth, black hair, long sleeves, holding, closed mouth, yellow eyes, upper body, flower, outdoors, open clothes, day, hair between eyes, tree, hood, animal, hoodie, pink flower, holding animal, bird, nature, forest, blue overalls, holding bird, parrot

3. Analytical Breakdown

camera_angle: from front,
art_style: digital illustration,
location: outdoor garden,
background: lush greenery with flowers and trees,
text: NA,
image_composition: middle,
clothing: blue overalls, beige long-sleeved shirt,
distance_to_camera: upper body,
hair_color: black hair,
facial_expression: smile,
action: holding a parrot,
accessory: NA,
pants: overalls

Florence-2 provides highly detailed and analytical captions.

These captions are precise and descriptive, making Florence-2 particularly effective for image-to-image processing and various applications requiring accurate image recognition.

Now, let's generate images using Florence-2 captions in the same iterative way as before.

Four anime illustrations of friendly parrots generated by Florence2 1
Maintain both parakeets and clothing
Four anime illustrations of friendly parrots generated by Florence2 2
Maintains the character of the clothing as well

Unlike CLIP-L, which gradually shifts the image composition over iterations, Florence-2 maintains the original features with high accuracy.

This precision makes Florence-2 ideal for industrial applications and fields requiring strict image recognition.

Comparing Three Auto-Prompt Methods

Now, let's compare the three auto-prompt generation methods introduced in this article.

TIPO Cliption ComfyUI-Florence2
Base Model LLaMA CLIP-L Florence-2
Advanced Model TIPO-500M-ft-FP16 LongCLIP-SAE-ViT-L-14-FP32 Florence-2-large-PromptGen-v2.0-FP32
Capacity 0.9 GB 1.8 GB 3.2 GB
Training Images ~25.7M ~2B ~126M
Dataset Danbooru2023
GBC10M
CoyoHD11M
LAION-2B FLD-5B
Civitai
Features Anime illustrations Wide range of categories Highly accurate captions

Although all three methods generate prompts automatically, TIPO, Cliption, and Florence-2 each have distinct characteristics.

For this test, I used the generated prompts as they were. However, in practical use, adding or removing specific keywords allows for more flexible control over the final output.

Installing TIPO and Florence-2

Finally, here’s a brief guide on how to install TIPO and Florence-2 using ComfyUI Manager.

TIPO Installation

Screenshot of the search screen for TIPO in custom nodes of comfyui manager with comment

For more details, refer to the official documentation .

Installing ComfyUI-Florence2

Screenshot of the Florence2Run custom node search screen in comfyui manager with comment
Screenshot of the search screen for tensorop custom node in comfyui manager with comment

To use ComfyUI-Florence2, you also need to install the comfyui-tensorop custom node.

I will cover the usage of Florence-2 in more detail in a future article.

For Cliption installation instructions, please refer to my previous article:

Conclusion: AI Has Its Own Personality!

  • TIPO specializes in anime illustrations.
  • CLIP-L provides broad creative flexibility.
  • Florence-2 excels in accurate analysis.

This article introduced three AI models for automatic prompt generation.

Although these models generate prompts automatically, the output reflects the influence of the training datasets used by each AI.

I primarily generate semi-realistic anime illustrations, and I found that CLIP-L offers the broadest range of artistic expression for my needs.

a young male animated character with brown hair and blue eyes holds a blue bird on his gloved hand surrounded by a lush garden with red roses wearing a dark blue shirt and white suspender.png (2520×2520)

In image generation, the highly expressive IP-Adapter utilizes CLIP (CLIP-L, CLIP-H, CLIP-G) as its caption model. I believe CLIP still has plenty of potential for further applications.

I look forward to further exploring the unique strengths of different AI models.

Thank you for reading!


Model Introduction

blue_pencil-flux1-v0.0.1