BLIP About evaluation scores on VQA v2.0 dataset

Hello, thanks for your nice work! I am now having trouble reproducing the reported score on the VQA task. I evaluated the checkpoint downloaded from https://storage.googleapis.com/sfr-vision-language-research/BLIP/models/model_base_vqa_capfilt_large.pth and followed the default settings in the ./config/vqa.yaml. However when I evaluated the generated results on the official server, I only got 77.44 on test-dev, which is significantly lower than 78.25 reported in the paper. Are there any possible reasons that could result in this performance degradation? Thanks!

May 25 '22 09:05 sdc17

Hi, you should be able to get the reported result by running python -m torch.distributed.run --nproc_per_node=8 train_vqa.py --evaluate. It will automatically take care of model downloading.

May 27 '22 10:05 LiJunnan1992

Hello, thanks for your nice work! I am now having trouble reproducing the reported score on the VQA task. I evaluated the checkpoint downloaded from https://storage.googleapis.com/sfr-vision-language-research/BLIP/models/model_base_vqa_capfilt_large.pth and followed the default settings in the ./config/vqa.yaml. However when I evaluated the generated results on the official server, I only got 77.44 on test-dev, which is significantly lower than 78.25 reported in the paper. Are there any possible reasons that could result in this performance degradation? Thanks!

@sdc17 @LiJunnan1992 Any updates on this issue? I have tried both using the VQA check-pointed weights (BLIP w/ ViT-B as well as BLIP w/ ViT-B and CapFilt-L) and fine-tuned from the pre-trained weights (129M), all three experiments give around 77.4x on test-dev. Is this expected?

Aug 14 '22 11:08 lorenmt

Hello, thanks for your nice work! I am now having trouble reproducing the reported score on the VQA task. I evaluated the checkpoint downloaded from https://storage.googleapis.com/sfr-vision-language-research/BLIP/models/model_base_vqa_capfilt_large.pth and followed the default settings in the ./config/vqa.yaml. However when I evaluated the generated results on the official server, I only got 77.44 on test-dev, which is significantly lower than 78.25 reported in the paper. Are there any possible reasons that could result in this performance degradation? Thanks!

@sdc17 @LiJunnan1992 Any updates on this issue? I have tried both using the VQA check-pointed weights (BLIP w/ ViT-B as well as BLIP w/ ViT-B and CapFilt-L) and fine-tuned from the pre-trained weights (129M), all three experiments give around 77.4x on test-dev. Is this expected?

@lorenmt Still haven't found what the problem is. But it can be seen from the evaluation server that BLIP did achieve the reported score.

Aug 15 '22 14:08 sdc17

@sdc17 Thanks for the update. At least we have confirmed that we reproduced a consistent performance.

Aug 15 '22 14:08 lorenmt

@lorenmt @sdc17 It has been reported by others that a discrepancy in PyTorch version can lead to different evaluation results. Can I know if your PyTorch is 1.10? If not, could you try to run the evaluation with PyTorch is 1.10?

Nov 07 '22 23:11 LiJunnan1992

@lorenmt @sdc17 It has been reported by others that a discrepancy in PyTorch version can lead to different evaluation results. Can I know if your PyTorch is 1.10? If not, could you try to run the evaluation with PyTorch is 1.10?

Thanks for your reminder! I obtained the results with PyTorch 1.11, I will try 1.10 later.

Nov 27 '22 07:11 sdc17