Which Is the Best Auto-Prompt Model? A Thorough Comparison of TIPO, Cliption, and Florence-2!


- TIPO is strong in anime illustrations.
- CLIP-L offers broad creative flexibility.
- Florence-2 provides precise, detailed analysis.
Introduction
Hello, this is Kimama / Easygoing.
Continuing from the previous article, I will explore methods for generating image prompts automatically.
Keywords vs. Images
In the previous article, I introduced two methods for automatically generating prompts: using keywords and using images.
flowchart LR
subgraph Input
A1(Keywords)
A2(Images)
end
subgraph AI
B1("Large Language Model (LLM)<br>Text -> Text")
B2("Vision Language Model (VLM)<br>Image -> Text")
end
C1(Prompt)
A1-->B1
A2-->B2
B1-->C1
B2-->C1
Last time, I used the highly versatile ChatGPT, but this time, I will look for LLMs and VLMs that can be used locally.
Generating Prompts from Keywords
To generate prompts from keywords, we use LLMs (Large Language Models).
Large Language Models (LLM: Text → Text)
Year | Released by | Parameters | Open Source | |
---|---|---|---|---|
T5 | 2019 | ~11B | Yes | |
GPT-3 | 2020 | OpenAI | 175B | No |
PaLM | 2022 | 540B | No | |
LLaMA | 2023 | Meta | 7B ~ 65B | Yes |
One of the most widely used open-source (or more precisely, open-weight) LLMs available for local use is LLaMA, which comes in various parameter sizes.
A LLaMA Model Specially Designed for Prompt Generation!
Within the LLaMA family, there is a model that has been fine-tuned on lightweight LLaMA versions for image prompt generation.
The most famous among them is TIPO.
TIPO: Text to Image with Text Presampling for Prompt Optimization
TIPO is available as a custom node in ComfyUI and as an extension for Stable Diffusion webUI (A1111, Forge, Reforge).
TIPO Excels in Anime Illustrations!
TIPO is a lightweight model with 500M or 200M parameters, trained on both image and text datasets to understand the textual characteristics of images.
TIPO has been trained on the following datasets:
- Danbooru2023 (5M+ images, tagged, anime illustrations)
- GBC10M (~10M images, natural language, photorealistic)
- CoyoHD11M (~11M images, natural language, photorealistic)
Although TIPO has also been trained on photorealistic images, its strongest feature is its extensive training on anime illustrations.
Testing TIPO!
Now, let’s generate images using TIPO.
I will use the same keywords as last time: parrot and friendly, and generate eight images in sequence using a natural language prompt.
Settings
- Temperature: 0.3
- Top-p: 0.3
- Top-k: 5


TIPO’s output varies significantly, but the parrot and friendly elements are only partially reflected.
Although I didn’t include any anime-related keywords, the generated prompts frequently lean toward anime illustrations, confirming that TIPO specializes in anime-style outputs.
Generating Prompts from Images!
Next, let’s generate prompts from images.
To generate prompts from images, we use Vision-Language Models (VLMs), which understand both text and images.
Vision-Language Models (VLM: Image + Text → Text)
Year | Released by | Parameters | Training Images | Open Source | |
---|---|---|---|---|---|
Open-CLIP-L | 2022 | LAION + Hugging Face | ~430M | ~2B | Yes |
GPT-4 | 2023 | OpenAI | Not Disclosed | Not Disclosed | No |
LLaVA | 2023 | Meta | 7B ~ 65B | ~110M | Yes |
Florence-2 | 2023 | Microsoft | 230M ~ 771M | ~126M | Partially Open |
Among VLMs, the most lightweight and locally usable options are CLIP-L and Florence-2.
Previously, I introduced Cliption, which uses CLIP-L to generate captions from images.
CLIP-L is trained on a large dataset, but its language comprehension is limited.
Meanwhile, Florence-2, released by Microsoft in June 2024, is as lightweight as CLIP-L but trained on highly detailed captioned datasets, giving it superior analytical and comprehension abilities.
Testing Captions!
Let’s compare captions generated by CLIP-L and Florence-2 using an anime-style illustration of a black-haired girl and a parrot.
Test Image

CLIP-L (LongCLIP-SAE-ViT-L-14)
A young girl with blue hair holds a colorful parrot in a lush, sunlit garden with pink and purple flowers, surrounded by a wooden fence, trees, flowers, and a bench, and a small house in the background, with warm sunlight filtering through the leaves and a serene and peaceful atmosphere. The woman wears a light-colored blouse and blue overalls.
CLIP-L’s caption is concise (about 50 words) and captures key details.
When used in iterative prompt generation, the generated images gradually change over multiple iterations.


As the images continued to be generated, the parrot gradually transformed into a teddy bear, while the anime girl turned into a photorealistic woman.
When using Cliption for continuous generation, the final result often converges into a realistic female illustration. This suggests that the training dataset likely contained a significant number of such photos.
Florence-2 (Florence-2-large-PromptGen v2.0)
Now, let's take a look at the captions generated by Florence-2.
The Florence-2-large-PromptGen v2.0 model is a fine-tuned version of the original Florence-2, trained on additional illustrations uploaded to Civitai. This enhancement improves the model’s ability to recognize images accurately.
Florence-2 can generate three types of captions:
- Natural language
- Tags, and
- Analytical descriptions
each suitable for different use cases.

1. Natural Language Description
A vibrant and detailed anime-style digital illustration from a front camera angle about a girl holding a parrot in a lush garden. The image features a young girl with long, black hair and golden eyes, standing in the middle of the image. She is wearing blue overalls and a beige hoodie, with a warm and inviting expression on her face. The girl is holding a blue and green parrot close to her chest, with its beak slightly open and its eyes looking directly at the viewer. The background is filled with lush greenery, pink flowers, and a wooden fence, creating a serene and peaceful atmosphere. The lighting is soft and warm, casting gentle shadows on the girl's face and the parrot's feathers. The overall style is reminiscent of Japanese anime, with detailed shading and vibrant colors that bring the scene to life.
2. Tags
1girl, solo, long hair, looking at viewer, blush, smile, open mouth, black hair, long sleeves, holding, closed mouth, yellow eyes, upper body, flower, outdoors, open clothes, day, hair between eyes, tree, hood, animal, hoodie, pink flower, holding animal, bird, nature, forest, blue overalls, holding bird, parrot
3. Analytical Breakdown
camera_angle: from front,
art_style: digital illustration,
location: outdoor garden,
background: lush greenery with flowers and trees,
text: NA,
image_composition: middle,
clothing: blue overalls, beige long-sleeved shirt,
distance_to_camera: upper body,
hair_color: black hair,
facial_expression: smile,
action: holding a parrot,
accessory: NA,
pants: overalls
Florence-2 provides highly detailed and analytical captions.
These captions are precise and descriptive, making Florence-2 particularly effective for image-to-image processing and various applications requiring accurate image recognition.
Now, let's generate images using Florence-2 captions in the same iterative way as before.


Unlike CLIP-L, which gradually shifts the image composition over iterations, Florence-2 maintains the original features with high accuracy.
This precision makes Florence-2 ideal for industrial applications and fields requiring strict image recognition.
Comparing Three Auto-Prompt Methods
Now, let's compare the three auto-prompt generation methods introduced in this article.
TIPO | Cliption | ComfyUI-Florence2 | |
---|---|---|---|
Base Model | LLaMA | CLIP-L | Florence-2 |
Advanced Model | TIPO-500M-ft-FP16 | LongCLIP-SAE-ViT-L-14-FP32 | Florence-2-large-PromptGen-v2.0-FP32 |
Capacity | 0.9 GB | 1.8 GB | 3.2 GB |
Training Images | ~25.7M | ~2B | ~126M |
Dataset | Danbooru2023
GBC10M CoyoHD11M |
LAION-2B | FLD-5B Civitai |
Features | Anime illustrations | Wide range of categories | Highly accurate captions |
Although all three methods generate prompts automatically, TIPO, Cliption, and Florence-2 each have distinct characteristics.
For this test, I used the generated prompts as they were. However, in practical use, adding or removing specific keywords allows for more flexible control over the final output.
Installing TIPO and Florence-2
Finally, here’s a brief guide on how to install TIPO and Florence-2 using ComfyUI Manager.
TIPO Installation

For more details, refer to the official documentation .
Installing ComfyUI-Florence2


To use ComfyUI-Florence2, you also need to install the comfyui-tensorop custom node.
I will cover the usage of Florence-2 in more detail in a future article.
For Cliption installation instructions, please refer to my previous article:
Conclusion: AI Has Its Own Personality!
- TIPO specializes in anime illustrations.
- CLIP-L provides broad creative flexibility.
- Florence-2 excels in accurate analysis.
This article introduced three AI models for automatic prompt generation.
Although these models generate prompts automatically, the output reflects the influence of the training datasets used by each AI.
I primarily generate semi-realistic anime illustrations, and I found that CLIP-L offers the broadest range of artistic expression for my needs.

In image generation, the highly expressive IP-Adapter utilizes CLIP (CLIP-L, CLIP-H, CLIP-G) as its caption model. I believe CLIP still has plenty of potential for further applications.
I look forward to further exploring the unique strengths of different AI models.
Thank you for reading!