How to reproduce the results on NQ
Hi, @luyug
Thanks for your awesome work. Is it possible to give more details to reproduce the results (84.3=MRR@5) on NQ in the paper, just like the detailed MS MARCO tutorial demo?
Looking forward to your reply. Thanks.
Hello,
Thanks for your interest! I am currently working on building JAX interfaces for tevatron. It will take a week or so before I can get back to add more examples and do quality assurance.
Do note that we used a private version of dpr instead of tevatron to run QA experiments in the papers, which help make sure we have aligned evaluation.
Hello,
Thanks for your interest! I am currently working on building JAX interfaces for tevatron. It will take a week or so before I can get back to add more examples and do quality assurance.
Do note that we used a private version of dpr instead of tevatron to run QA experiments in the papers, which help make sure we have aligned evaluation.
Thanks for your reply! Looking forward to your released pipeline.
Another issue is that whether your SOTA model on NQ is trained with only mined hard negatives or with both BM hard negatives and mined hard negatives as DPR github? Though it seems to be only mined hard negatives in your paper, I want to check again for I found it better with both BM hard negatives and mined hard negatives in my experiments.
I used both BM25 and mined negatives, aligning with the DPR setup. I don't have thorough experiments on whether it is better to include or not to include BM25 negatives.
Hi, @luyug
Thank you very much. Sorry to disturb again.
I wonder if the Tevatron Toolkit can run experiment on MS MARCO document ranking set? Or could you share how to reproduce the results on MS-MARCO Documen in the paper Condenser: a Pre-training Architecture for Dense Retrieval ?
The data for document ranking is structured in a similar way as the passage ranking dataset. I think the easiest thing is probably to follow the passage ranking example and swap in the document ranking data.
The data for document ranking is structured in a similar way as the passage ranking dataset. I think the easiest thing is probably to follow the passage ranking example and swap in the document ranking data.
Thanks for such a quick reply. I will try it.
Hi, @luyug I find it hard to reproduce your results for NQ and TQA, and it is lower about 0.5-1.5 points than your reported. I wonder if it's possible to get your mined hard negatives of NQ and TQA? My mail is [email protected]
I'm currently a little bit busy, working on several paper/school work deadlines. Will get back to you in a few weeks.
@Dopaminezsy I am assuming you are using tevatron for DPR repro? I wonder if you have tried using the original DPR toolkit. We are triaging some issues with DPR on tevatron right now.
Sorry to reply later. I am now following the original DPR toolkit rather than Tevatron.
So were your results totally based on Tevatron? I have noticed your shared results is already higher than original DPR toolkit.
This is unexpected. I intentionally used the DPR toolkit in the paper to make sure the results are comparable to previous works. Locally I have been seeing rather stable results with DPR. I mentioned tevatron because we were having some unexpected regressions there.
If you are using DPR, make sure you are running the exact same setup as in the original DPR instruction, including 8GPUs, 16 batch size per device, DDP training, etc..
Thanks, I will check them. In addition, I think the code about how to mine hard-negatives will help, for potentially different details. Another is can you share the first round results with only BM25 negatives on NQ, which may also help :)
hi, I found that the number of bm25 negative samples and hard negative samples provided in nq-example is different, is this reasonable?