[`Research Project`] Add AnyText: Multilingual Visual Text Generation And Editing
Thanks for the opportunity to fix #6407!
AnyText comprises a diffusion pipeline with two primary elements: an auxiliary latent module and a text embedding module. The former uses inputs like text glyph, position, and masked image to generate latent features for text generation or editing. The latter employs an OCR model for encoding stroke data as embeddings, which blend with image caption embeddings from the tokenizer to generate texts that seamlessly integrate with the background. We employed text-control diffusion loss and text perceptual loss for training to further enhance writing accuracy.
Paper: AnyText: Multilingual Visual Text Generation And Editing Repository: https://github.com/tyxsspa/AnyText Hugging Face Space: modelscope/AnyText
TODOs:
⏳ AuxiliaryLatentModule
:white_check_mark: AnyTextControlNetModel -> Inherited and adapted from ControlNetModel. The only difference is that using Gylph Block, Position Block, and Fuse Block instead of input_hint_block or controlnet_cond_embedding from an ordinary ControlNet -ControlNetConditioningEmbedding is different. I deactivated the ControlNetConditioningEmbedding part and moved the new blocks into AuxiliaryLatentModule just to comply with the Figure.
⏳ AnyTextPipeline -> Adapted from StableDiffusionControlNetPipeline.
⏳ TextEmbeddingModule -> Replaces the encode_prompt() function. I may transfer what TextEmbeddingModule does into encode_prompt().
:white_check_mark: convert_anytext_to_diffusers.py
⏳ Verify outputs with the original implementation
⏳ Finish HF integration & upload converted checkpoints to HF
⏳ README.md
:white_large_square: Make it as simple as possible, but not simpler
The first results seem okayish...
prompt = 'photo of caramel macchiato coffee on the table, top-down perspective, with "Any" "Text" written on it using cream'
| Original Implementation | My Current Implementation |
|---|---|
I am still checking if there is something wrong.
Edit: There was indeed a mistake I made: I forgot to load the parameters for the linear layer at the top of the OCR model.
Absolutely amazing work here @tolgacangoz :heart: Thank you for picking this up after a stream of contributors (including me) mentioned that they'd take it up but weren't able to! The PR looks mostly good to me but I'll wait for it to be unmarked from draft.
Really really cool work here! cc @sayakpaul
Thanks so much @a-r-r-o-w!
This PR is almost done. I am currently working on the Matryoshka model. I guess it is the priority. When I complete it I will immediately turn back to this PR.
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
Hi @tolgacangoz, are you still working on this?
Hi Álvaro, thanks for nudging me :) My priorities have had to change over the last 4-5 months. Starting tomorrow, I plan to complete this PR within 1 week.
This will be my second pipeline contribution, yay :partying_face:
thanks a lot, it looks good to me, really amazing project and port to diffusers with good results.
ccing @a-r-r-o-w because of https://github.com/huggingface/diffusers/pull/8998#issuecomment-2308036940
Thanks for this opportunity to contribute!