Jorge C. Gomes
Jorge C. Gomes
An example for finetuning FLAVA or any VLP multimodel using trainer (for example for classification)
The issue was auto marked as closed, but there aren't yet any resources on how to fine-tune FLAVA. Neither of the links posted above by @NielsRogge have instructions on fine-tuning....
It is a bit puzzling to me why T5 is a crucial component of Imagen. It probably isn't? It's probably just the size and dimension of the encoder that matters....
> * Can we modify the space of captions and still get good results? It's hard to generate captions for images, but..., it's easy to (for example) generate tags for...
I had to manually compile (using the provided script), because for some reason, it wasn't being compiled on demand, and I got an error of missing libraries. In any case,...
The negative prompt is simply the prompt that is used for the "unconditional" generation in Classifier-Free Guidance. In this implementation it is hardcoded to be an empty string (or rather,...
The clip-vit-large-patch14 (https://huggingface.co/openai/clip-vit-large-patch14) model used by SD can only handle sequences of 77 tokens. It works like that in the original pytorch implementation as well. Anything longer than that gets...
Just leaving a brief report of my findings with PAG and Diffusers (I already had it integrated in my pipelines before this PR): - It generally works very very well...
Looks very interesting. Could this theoretically be used in the opposite way, to generate smaller images? When trying to generate small images with models that have been trained with high-res...
@Abhinay1997 FYI, some findings based on my own experiments with Tune-A-Video: 1. using prior preservation loss (as implemented in https://github.com/bryandlee/Tune-A-Video/blob/main/train.py) helps a lot with the relevance of the output videos...
@sayakpaul Yes, definitely. I'll keep an eye on the PR that @Abhinay1997 will open 👍