About evaluation scores on VQA v2.0 dataset
Hello, thanks for your nice work! I am now having trouble reproducing the reported score on the VQA task. I evaluated the checkpoint downloaded from https://storage.googleapis.com/sfr-vision-language-research/BLIP/models/model_base_vqa_capfilt_large.pth and followed the default settings in the ./config/vqa.yaml. However when I evaluated the generated results on the official server, I only got 77.44 on test-dev, which is significantly lower than 78.25 reported in the paper. Are there any possible reasons that could result in this performance degradation? Thanks!
Hi, you should be able to get the reported result by running python -m torch.distributed.run --nproc_per_node=8 train_vqa.py --evaluate. It will automatically take care of model downloading.
Hello, thanks for your nice work! I am now having trouble reproducing the reported score on the VQA task. I evaluated the checkpoint downloaded from https://storage.googleapis.com/sfr-vision-language-research/BLIP/models/model_base_vqa_capfilt_large.pth and followed the default settings in the ./config/vqa.yaml. However when I evaluated the generated results on the official server, I only got 77.44 on test-dev, which is significantly lower than 78.25 reported in the paper. Are there any possible reasons that could result in this performance degradation? Thanks!
@sdc17 @LiJunnan1992 Any updates on this issue? I have tried both using the VQA check-pointed weights (BLIP w/ ViT-B as well as BLIP w/ ViT-B and CapFilt-L) and fine-tuned from the pre-trained weights (129M), all three experiments give around 77.4x on test-dev. Is this expected?
Hello, thanks for your nice work! I am now having trouble reproducing the reported score on the VQA task. I evaluated the checkpoint downloaded from https://storage.googleapis.com/sfr-vision-language-research/BLIP/models/model_base_vqa_capfilt_large.pth and followed the default settings in the ./config/vqa.yaml. However when I evaluated the generated results on the official server, I only got 77.44 on test-dev, which is significantly lower than 78.25 reported in the paper. Are there any possible reasons that could result in this performance degradation? Thanks!
@sdc17 @LiJunnan1992 Any updates on this issue? I have tried both using the VQA check-pointed weights (BLIP w/ ViT-B as well as BLIP w/ ViT-B and CapFilt-L) and fine-tuned from the pre-trained weights (129M), all three experiments give around 77.4x on test-dev. Is this expected?
@lorenmt Still haven't found what the problem is. But it can be seen from the evaluation server that BLIP did achieve the reported score.
@sdc17 Thanks for the update. At least we have confirmed that we reproduced a consistent performance.
@lorenmt @sdc17 It has been reported by others that a discrepancy in PyTorch version can lead to different evaluation results. Can I know if your PyTorch is 1.10? If not, could you try to run the evaluation with PyTorch is 1.10?
@lorenmt @sdc17 It has been reported by others that a discrepancy in PyTorch version can lead to different evaluation results. Can I know if your PyTorch is 1.10? If not, could you try to run the evaluation with PyTorch is 1.10?
Thanks for your reminder! I obtained the results with PyTorch 1.11, I will try 1.10 later.