OneTrainer [Feat]: Caption/tags enhancement with multimodal LLMs

Describe your use-case.

There are multiple simple models used in this repository: Blip, Clip and WD-taggers. However, when it comes to detailed description, they are all dwarfed by modern multimodal LLMs such as LLaVA-like, CogVLM or InternLM-XComposer2 models. The latter has the coolest capabilities as of now, as it allows feeding in up to 4K resolution images, captioning extremely fine-details.

On top of that, unlike existing repo models, these ones can receive text input beside the images, so it is possible to enhance the preexisting captions or tags

As shown by PixArt-series models, especially PixArt-Sigma, well captioned images. However, it applies mainly to LLM-embeddings based models (using T5 or other LLMs, with context > 300) as the models such as CLIP have very limited context length, resolution, embedding layer size and pretrain data to make good captions. (so not much good impact for SD1.5 or SDXL)

What would you like to see as a solution?

Add openai-api/ollama compatible calling mechanism to (batch) captioning section.
Add fully customizable prompt template with ability to insert the pre-available captions or Danbooru tags inside the prompt and where to put the image tokens
Add ability to insert jailbreaks in the start of the answer, such as Sure! Here is the description: " to game aligned models (local, but who got it from datasets such as ShareGPT-4V)
Add parsing of the generated descriptions, maybe with regex
AlignProp RL dataset generation with MLLM's preference out of multiple suggested images

Have you considered alternatives? List them here.

No response

May 22 '24 14:05 kabachuha

Most of these features, if not all, you get with TagGui, just in case you or other don't know this tool.

However, one model i would like to see in an expanded list in OT would be moondream2, small sized (4gb) powerful and reliable for natural language caption. More reliable in my experience than 16gb lliama3 competitor or LLaVA, which write way too often plain nonsense and require supervision and careful prompting.

May 25 '24 15:05 madrooky

@madrooky thanks for your response, looks like a very nice tool ❤️

May 25 '24 17:05 kabachuha

Dataset Helpers is another great tool. Either way personally I believe this is scope creep and the issue should be closed but thats just me.

Jul 12 '24 01:07 O-J1

TL;DR; [NOT AN AD] there is a taggui tool supporting the many LLMs and captioning models (from WD to COGVLM v2). Implementing a rich UI using a Tk library almost impractical...

Colleagues, I understand the idea of One Trainer for all and things like that... But I have a huge experience in desktop and web GUI development to say the following. I am sure that few separate UI's with different User Experience works slightly better (if there is a quick transition between the two), than One Overloaded with Features Trainer GUI.

Just look at current UI. There is no Ctrl+A in text boxes because the stock components don't offer this feature out of the box. There is no markup language (I believe but not sure). Implementing rich UI will take a lot of time and will produce a lot of bugs. But what we will get for that? A less convenient copy of taggui?

As for me, this request is not so profitable. PS: not an offence; just want to focus the devs on core features like Lycoris training etc.

Aug 17 '24 07:08 homoluden

I personally recommend taggui to everyone asking this question right now. It's a good tool.

Aug 17 '24 14:08 mx

For now there significant work to be done in other parts of the application. You are better served by using a dedicated tool for this like Taggui or DatasetHelpers. Additionally all VLLMS currently suck because they are not capable on even slightly NSFW content which means if it misunderstands, it flips out usually. Not worth the effort currently.

Oct 13 '24 16:10 O-J1