diffusers icon indicating copy to clipboard operation
diffusers copied to clipboard

Added multitoken training for textual inversion. Issue 369

Open isamu-isozaki opened this issue 3 years ago • 6 comments

I used multiple tokens to represent a concept by adding num_vec_per_token number of tokens to the tokenizer which can be initialized with the initial_token. The tokens would be labeled as placeholder_token_i where i is the ith token. This does make it more verbose but it does accomplish the goal. An alternative is doing this in an inherited tokenizer class.

I also added multiple ways to initialize the tokens. For example, if the initialize_rest_random is set to true, if you have an initial token of "white dog", it'll just set all the tokens before it to random noise. But the default is for the initial tokens to be assigned in a cyclic way. For example, if num_vec_per_token is 4 and initial_token is "white dog". The weights would initially be "white dog white dog".

isamu-isozaki avatar Sep 28 '22 03:09 isamu-isozaki

The documentation is not available anymore as the PR was closed or merged.

@patil-suraj Sounds good! I'll also fix the merge conflict now. And let me know if you want me to work on the tokenizer approach like the original implementation. I have a guess on how to do it!

isamu-isozaki avatar Sep 28 '22 16:09 isamu-isozaki

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

github-actions[bot] avatar Oct 28 '22 15:10 github-actions[bot]

@patil-suraj could you take a look here? :-)

patrickvonplaten avatar Oct 31 '22 18:10 patrickvonplaten

cc @patil-suraj here - let me know if you won't have time for this at the moment. I think this is still important though no?

patrickvonplaten avatar Nov 16 '22 07:11 patrickvonplaten

And then how do we use multitoken embeddings during inference? Do we have to explicitly prompt with "a holiday card with {frida0} {frida1} {frida2}", or is there a way to just prompt "with {frida}" and the tokenizer unpacks that to its three tokens?

keturn avatar Nov 30 '22 00:11 keturn

@keturn Thanks for the q! For now, I'm doing the "a holiday card with {frida0} {frida1} {frida2}" approach but one thing I found for the original implementation is that you can concatenate the embedding of the tokens into a single vector. So, for example, it can be used like that for automatic111's API.

But to do this, I think the tokenizer for the text encoder in the transformers library might need some changes/make a new tokenizer for that. Will try it out!

isamu-isozaki avatar Dec 01 '22 03:12 isamu-isozaki

So, for example,

learned_embeds = accelerator.unwrap_model(text_encoder).get_input_embeddings().weight[placeholder_token_ids]
learned_embeds_dict = {}
learned_embeds_dict[args.placeholder_token] = learned_embeds.detach().cpu()
torch.save(learned_embeds_dict, os.path.join(args.output_dir, f"learned_embeds_{global_step}.bin"))

can be a file you can use for the latent-diffusion/automatic111 repo.

isamu-isozaki avatar Dec 01 '22 03:12 isamu-isozaki

Hello, while working on a personal project I think I have an idea of the tokenizer by inheriting now so I'll try pushing that today or tomorrow

isamu-isozaki avatar Dec 21 '22 21:12 isamu-isozaki

Hi. I added a new multitokenCLIPTokenizer which was inspired by @keturn 's suggestion of switching between and _0 _1... It's still a work in progress so I'll test it first. I also refactored a code a bit

isamu-isozaki avatar Dec 24 '22 04:12 isamu-isozaki

ok. Currently a bit buggy but fixing.

isamu-isozaki avatar Dec 25 '22 05:12 isamu-isozaki

Ok! Fixed bug I'll post some results here once I'm done with training some examples

isamu-isozaki avatar Dec 25 '22 21:12 isamu-isozaki

Hi, could this be used to load automatic1111 embeddings to diffusers sd model? I was trying to achieve this with an extra tokenizer class similar to multitokenCLIPTokenizer and adding the loaded embeddings to the text_encoder. However i get black images on stable-diffusion-2-1 and images on stable-diffusion-2-1-base are nothing like the previews from the textual inversion embedding i loaded. Am i missing something?

pkurz3nd avatar Dec 30 '22 11:12 pkurz3nd

@pkurz3nd interesting. Can you show some examples? I heard one issue is that in automatic111 you might use a different vae like anything-v3-vae. I'll try adding a load script today if you want to test it out!

isamu-isozaki avatar Dec 30 '22 14:12 isamu-isozaki

Hi, thanks for the quick reply. here is my notebook. https://colab.research.google.com/drive/1faGxmnfs3kARqa7GTmGiWKsqd-cPAE1v?usp=sharing

this is the embedding i was testing with: https://huggingface.co/spaablauw/FloralMarble

when i load it in torch, it says: 'sd_checkpoint': '4bdfc29c', 'sd_checkpoint_name': 'SD2.1-768'

it would be awesome if you could check out the script, i am also keep trying.

pkurz3nd avatar Dec 30 '22 15:12 pkurz3nd

sounds good lemme check

isamu-isozaki avatar Dec 30 '22 15:12 isamu-isozaki

@pkurz3nd Ok so checked your code. This might not be the sole issue but I don't think you assign the loaded embedding matrix to the tokenizer embeddings. So basically the tokenizer would prob just have random embeddings. Will make a load method and test it out on my end!

isamu-isozaki avatar Dec 30 '22 17:12 isamu-isozaki

sry you are right, i somehow stopped writing on the very last line haha

this is the updated notebook https://colab.research.google.com/drive/1-U0IdnYDGI_MJV6uC89VVhtTN51F0S_L?usp=sharing

works like a charm now with the stabilityai/stable-diffusion-2-1-base version index

with the stabilityai/stable-diffusion-2-1 version i still get black images but thats a different problem

thanks for looking into it

pkurz3nd avatar Dec 30 '22 18:12 pkurz3nd

No worries! Looks awesome btw. For a blank image, it might be either a safety checker problem which u can disable by setting safety_checker=None or it might be an fp16 problem but I don't think you are using fp16.

isamu-isozaki avatar Dec 30 '22 18:12 isamu-isozaki

I think I recreated what you did in this colab notebook tho I'm not able to get as good results as you

isamu-isozaki avatar Dec 30 '22 21:12 isamu-isozaki

Found a bit of a bug for training. Fixed and retraining now. I'll change this pr to WIP for a bit while I confirm there aren't any more bugs

isamu-isozaki avatar Dec 30 '22 21:12 isamu-isozaki

i was running your notebook, i got an error that text in tokenizer.encode(...) is None, and think there is a bug in the very last line of the load_multitoken_tokenizer funtion

        for i, placeholder_token_id in enumerate(placeholder_token_ids):
            token_embeds[placeholder_token_id] =  token_embeds[i] 

pkurz3nd avatar Dec 30 '22 22:12 pkurz3nd

thanks for the bug check! will fix in a bit

isamu-isozaki avatar Dec 30 '22 22:12 isamu-isozaki

@pkurz3nd should be fixed! download (5)

isamu-isozaki avatar Dec 31 '22 00:12 isamu-isozaki

Currently working on a weird bug where regardless of what initializer token I set for training, I get various variations of this lady. media_images_samples_3_fa8340a9070dfef796f3 this has never happened before but I thought it would be pretty interesting so I put it here.

isamu-isozaki avatar Dec 31 '22 16:12 isamu-isozaki

That's Frida Kahlo, which makes some sense if you're trying to train the name Frida.

keturn avatar Dec 31 '22 17:12 keturn

@keturn Ohhhh so the tokenizer isn't replacing the texts properly. Thanks a bunch

isamu-isozaki avatar Dec 31 '22 17:12 isamu-isozaki

The issue turned out to be that I was overloading the encode method but not the call method. So the models were learning but logging the wrong image since in the pipeline the call method is used. Will fix it in a bit!

isamu-isozaki avatar Dec 31 '22 17:12 isamu-isozaki

Ok! Should be fixed. I'll start writing some tests on the hugging face test for cleaner code.

isamu-isozaki avatar Dec 31 '22 18:12 isamu-isozaki

While I'm doing this pr, I think I'll add progressive tokens from latent-diffusion. The basic idea is, if I understand correctly, you first train on one token then you add another and another, etc. I am curious how it'll do compared to training from scratch on say 10 tokens.

isamu-isozaki avatar Dec 31 '22 19:12 isamu-isozaki