CLIP-T, CLIP-I implementation
Dear author,
Thanks so much for the great contribution to the community, in recent SD benchmark models, they often mention the subject fidelity using CLIP-I and DINO, for prompt fidelity they used either CLIP-T or COCA. Do you have any plans to implement these metrics? or just give some insights are also valuable. Once again, thank you!
Hi,
Thank you for your appreciation of this repository!
Recently, I've received several requests from the community. Unfortunately, I'm quite busy at the moment, but I hope to make these improvements within the next month.
If you'd like to implement this on your own, please check this line. You can replace the model with your desired encoder and pay attention to the function call at this line.
Also, pull requests are always welcome!
Hi, I’m currently available to update this repo. I was wondering if it would be possible to provide model links for CLIP-T and CLIP-I?
Hi, I’m currently available to update this repo. I was wondering if it would be possible to provide model links for CLIP-T and CLIP-I?
Hello, author. Haven't you already implemented the code for calculating CLIP-I, which is what's mentioned in the README? Why is there a need for re-implementation? Could it be that I have a misunderstanding about the CLIP score and CLIP-I?
Hi, I misunderstood — I thought CLIP-I and CLIP-T referred to different transformer settings. The current release does support calculating CLIP scores for both text-image, text-text, and image-image pairs.