Yufan
Yufan
If you want generate something like, "an image of A and B shaking hands", after fine-tuning the model on photos of A and B. Then I think revision of [pipeline_stable_diffusion_promptnet.py](https://github.com/drboog/ProFusion/blob/main/diffusers/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_promptnet.py)...
I may update the code later, but I'm not sure how the performance will be.
I wrote an implementation and tested it on my local machine. Unfortunately, the performance is not satisfying. For example, when we ask it to generate a photo of A and...
1. You don't find attention mask in diffuser example, because that is already implemented in OpenCLIP text encoder. Please check the original code of text encoder, in huggingface transformers repo....
Yes, given a different pre-trained text-to-image generation model, you need to retrain the encoder. If you do not want to train the encoder, you can also try some super resolution...
Yes, you need to change the clip model, and also change some dimension arguments. For example, because SD v1.5 and SD v2 are based on different text encoders, you need...
I didn't save weights of that fine-tuned model, but you can try to fine-tune the model by your self. You can try slightly larger batch size and more iteration steps...
> > I didn't save weights of that fine-tuned model, but you can try to fine-tune the model by your self. You can try slightly larger batch size and more...
Hi, when you fine-tune with multiple ground-truth images, be careful about the mapping during training, i.e. images should be first divided into different groups, X = {x_0, x_1, ..., x_n},...
You can either use train.py or test.ipynb, but both need to be revised. If you want to use train.py for this experiment: - Because UNet is not fine-tuned in train.py,...