clip-score CLIP-T, CLIP-I implementation

Dear author,

Thanks so much for the great contribution to the community, in recent SD benchmark models, they often mention the subject fidelity using CLIP-I and DINO, for prompt fidelity they used either CLIP-T or COCA. Do you have any plans to implement these metrics? or just give some insights are also valuable. Once again, thank you!

Jul 23 '24 14:07 Sundragon1993

Hi,

Thank you for your appreciation of this repository!

Recently, I've received several requests from the community. Unfortunately, I'm quite busy at the moment, but I hope to make these improvements within the next month.

If you'd like to implement this on your own, please check this line. You can replace the model with your desired encoder and pay attention to the function call at this line.

Also, pull requests are always welcome!

Jul 25 '24 16:07 Taited

Hi, I’m currently available to update this repo. I was wondering if it would be possible to provide model links for CLIP-T and CLIP-I?

Apr 15 '25 06:04 Taited

Hi, I’m currently available to update this repo. I was wondering if it would be possible to provide model links for CLIP-T and CLIP-I?

Hello, author. Haven't you already implemented the code for calculating CLIP-I, which is what's mentioned in the README? Why is there a need for re-implementation? Could it be that I have a misunderstanding about the CLIP score and CLIP-I?

Apr 23 '25 11:04 xzm-whq

Hi, I misunderstood — I thought CLIP-I and CLIP-T referred to different transformer settings. The current release does support calculating CLIP scores for both text-image, text-text, and image-image pairs.

Apr 24 '25 04:04 Taited