Results of T2I-Compbench
Hello, I'm interested in your perfect work. And I evaluated your method on the T2I-Compbench. The results are far from what you showed in the paper. I wonder if something has gone wrong?
Here are the implementation details:
-
First, I got the layout from GPT4.
-
I evaluated it in the color_val.txt of T2I-Compbench, which contains 300 prompts. (using BLIP-vqa method and --np_num 8 by default)
-
I only got. 39.84 attributes, but the result is 93 in your paper.
Could you please offer the layout file you use for T2I-Compbench? Or could you please tell me if something is wrong?
Hello, I'm interested in your perfect work. And I evaluated your method on the T2I-Compbench. The results are far from what you showed in the paper. I wonder if something has gone wrong?
Here are the implementation details:
- First, I got the layout from GPT4.
- I evaluated it in the color_val.txt of T2I-Compbench, which contains 300 prompts. (using BLIP-vqa method and --np_num 8 by default)
- I only got. 39.84 attributes, but the result is 93 in your paper.
![]()
Could you please offer the layout file you use for T2I-Compbench? Or could you please tell me if something is wrong?
Thank you for your interest. Can you please save the images generated by you in a cloud storage and share it with us so that we can test it again?
Thanks very much! Please wait a moment.
https://drive.google.com/file/d/1IEi-aQ_WkpP7SQcLeQiuOgxHE68hzh1A/view?usp=drive_link Here are generated images. Could you download and open it successfully? This zip file including some sub files:
raw_image/: including images that evaluated on color of T2I-Compbench, vis_layout/: including images with bbox layout for visualization annotation_blip/: including evaluation details of color of T2I-Compbench
The link indicates that I need to have access permission. I have already requested permission from you using my email. If you are unable to grant access, you can send me a compressed file to my email: [email protected]
I have granted the access. Thanks!
Thank you for your question. Due to the particularity of this benchmark, different random seeds have a significant impact on the test results for generating images. Based on the original code of T2I-Compbench, when dealing with complex prompts, ten images are required for each prompt. We believe that this approach should also be applied to simple prompts. Therefore, for these results, we conducted 10 repeated experiments for each prompt. You can change the seed for each prompt to conduct multiple experiments and obtain results.
Thank you for your question. Due to the particularity of this benchmark, different random seeds have a significant impact on the test results for generating images. Based on the original code of T2I-Compbench, when dealing with complex prompts, ten images are required for each prompt. We believe that this approach should also be applied to simple prompts. Therefore, for these results, we conducted 10 repeated experiments for each prompt. You can change the seed for each prompt to conduct multiple experiments and obtain results.
Thanks for your answer! Do you mean I need to generate 10 images for each prompts using RealComp with different random seeds?
Thank you for your question. Due to the particularity of this benchmark, different random seeds have a significant impact on the test results for generating images. Based on the original code of T2I-Compbench, when dealing with complex prompts, ten images are required for each prompt. We believe that this approach should also be applied to simple prompts. Therefore, for these results, we conducted 10 repeated experiments for each prompt. You can change the seed for each prompt to conduct multiple experiments and obtain results.
Thanks for your answer! Do you mean I need to generate 10 images for each prompts using RealComp with different random seeds?
Yes, that's right.
Hi @Cominclip ,
I was testing the model on the benchmark and found the same discrepancy as @AdventureStory. I tried different seeds based on your last comment and the results are still the same. The numbers quoted in the paper for Color category is 0.774
My results are:
Used GPT-4 to generate layouts.
| Seed | Score |
|---|---|
| 0 | 0.451 |
| 117 | 0.383 |
| 393 | 0.348 |
| 423 | 0.434 |
| 486 | 0.391 |
| 700 | 0.404 |
| 717 | 0.360 |
With the average of 7 runs being 0.395 which is way off the presented number.
Maybe the authors can share their images or the approach they used to calculate these numbers