Possible to use vit-B/32 instead of vit-L/14 for SD fine-tuning on Dreambooth?
I am wondering if I can change the default clip model to run my training and if so, how?
@yiyixuxu could you take a look here? :-)
Hi @stpg06:
If you want to experiment with a different text encoder, you could modify this part in the training script https://github.com/huggingface/diffusers/blob/main/examples/dreambooth/train_dreambooth.py#L605
text_encoder = text_encoder_cls.from_pretrained(...)
YiYi
Just to add one more comment here, @stpg06 note that dreambooth fine-tunes an already trained checkpoint. If this already trained checkpoint has been trained with a vit-L/14 text encoder then you will probably get bad results when swapping out this text encoder with another one (vit-B/32) because the unet has not been trained on it.
Long story short, I don't think it makes much sense to swap text encoders for dreambooth, however for text-to-image training it could make a lot of sense :-)
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.