Vision_by_Language Results from the FashionIQ dataset

I used openAI's CLIP ViT-B/32 to test on FashionIQ's validation set. The results obtained and the results reported in the paper are very different, may I ask what skills exist? Did the results in the paper come from using openAI's CLIP ViT-B/32? It seems to be closer to the openclip result.

Results from openAI: dress_Recall@1 = 3.47 dress_Recall@5 = 9.87 dress_Recall@10 = 14.53 dress_Recall@50 = 33.22

Results from openclip: dress_Recall@1 = 7.44 dress_Recall@5 = 18.74 dress_Recall@10 = 25.33 dress_Recall@50 = 46.50

Apr 16 '25 04:04 Y111555

Hello, thanks for your interest in our work! I just went through regenerating the results on different benchmarks and realized that for Fashion-IQ they do seem to be based on the OpenCLIP series of models (your results look very similar to what I am getting with the older gpt3.5-turbo generated captions). I'm sorry for the confusion, and I hope this helps.

Apr 17 '25 12:04 sgk98

Ok. Thank you for your clarification

Apr 17 '25 12:04 Y111555