diffusers Possible to use vit-B/32 instead of vit-L/14 for SD fine-tuning on Dreambooth?

I am wondering if I can change the default clip model to run my training and if so, how?

Feb 13 '23 22:02 stpg06

@yiyixuxu could you take a look here? :-)

Feb 14 '23 21:02 patrickvonplaten

Hi @stpg06:

If you want to experiment with a different text encoder, you could modify this part in the training script https://github.com/huggingface/diffusers/blob/main/examples/dreambooth/train_dreambooth.py#L605

text_encoder = text_encoder_cls.from_pretrained(...)

YiYi

Feb 15 '23 02:02 yiyixuxu

Just to add one more comment here, @stpg06 note that dreambooth fine-tunes an already trained checkpoint. If this already trained checkpoint has been trained with a vit-L/14 text encoder then you will probably get bad results when swapping out this text encoder with another one (vit-B/32) because the unet has not been trained on it.

Long story short, I don't think it makes much sense to swap text encoders for dreambooth, however for text-to-image training it could make a lot of sense :-)

Feb 15 '23 10:02 patrickvonplaten

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Mar 16 '23 15:03 github-actions[bot]