tevatron How to reproduce the results on NQ

Hi, @luyug

Thanks for your awesome work. Is it possible to give more details to reproduce the results (84.3=MRR@5) on NQ in the paper, just like the detailed MS MARCO tutorial demo?

Looking forward to your reply. Thanks.

Oct 14 '21 03:10 shunyuzh

Hello,

Thanks for your interest! I am currently working on building JAX interfaces for tevatron. It will take a week or so before I can get back to add more examples and do quality assurance.

Do note that we used a private version of dpr instead of tevatron to run QA experiments in the papers, which help make sure we have aligned evaluation.

Oct 18 '21 14:10 luyug

Hello,

Thanks for your interest! I am currently working on building JAX interfaces for tevatron. It will take a week or so before I can get back to add more examples and do quality assurance.

Do note that we used a private version of dpr instead of tevatron to run QA experiments in the papers, which help make sure we have aligned evaluation.

Thanks for your reply! Looking forward to your released pipeline.

Another issue is that whether your SOTA model on NQ is trained with only mined hard negatives or with both BM hard negatives and mined hard negatives as DPR github? Though it seems to be only mined hard negatives in your paper, I want to check again for I found it better with both BM hard negatives and mined hard negatives in my experiments.

Oct 19 '21 07:10 shunyuzh

I used both BM25 and mined negatives, aligning with the DPR setup. I don't have thorough experiments on whether it is better to include or not to include BM25 negatives.

Oct 19 '21 14:10 luyug

Hi, @luyug

Thank you very much. Sorry to disturb again.

I wonder if the Tevatron Toolkit can run experiment on MS MARCO document ranking set? Or could you share how to reproduce the results on MS-MARCO Documen in the paper Condenser: a Pre-training Architecture for Dense Retrieval ?

Oct 27 '21 06:10 shunyuzh

The data for document ranking is structured in a similar way as the passage ranking dataset. I think the easiest thing is probably to follow the passage ranking example and swap in the document ranking data.

Oct 27 '21 21:10 luyug

The data for document ranking is structured in a similar way as the passage ranking dataset. I think the easiest thing is probably to follow the passage ranking example and swap in the document ranking data.

Thanks for such a quick reply. I will try it.

Oct 29 '21 04:10 shunyuzh

Hi, @luyug I find it hard to reproduce your results for NQ and TQA, and it is lower about 0.5-1.5 points than your reported. I wonder if it's possible to get your mined hard negatives of NQ and TQA? My mail is [email protected]

Nov 02 '21 12:11 shunyuzh

I'm currently a little bit busy, working on several paper/school work deadlines. Will get back to you in a few weeks.

Nov 11 '21 20:11 luyug

@Dopaminezsy I am assuming you are using tevatron for DPR repro? I wonder if you have tried using the original DPR toolkit. We are triaging some issues with DPR on tevatron right now.

Nov 30 '21 18:11 luyug

Sorry to reply later. I am now following the original DPR toolkit rather than Tevatron.

So were your results totally based on Tevatron? I have noticed your shared results is already higher than original DPR toolkit.

Dec 07 '21 12:12 shunyuzh

This is unexpected. I intentionally used the DPR toolkit in the paper to make sure the results are comparable to previous works. Locally I have been seeing rather stable results with DPR. I mentioned tevatron because we were having some unexpected regressions there.

If you are using DPR, make sure you are running the exact same setup as in the original DPR instruction, including 8GPUs, 16 batch size per device, DDP training, etc..

Dec 07 '21 20:12 luyug

Thanks, I will check them. In addition, I think the code about how to mine hard-negatives will help, for potentially different details. Another is can you share the first round results with only BM25 negatives on NQ, which may also help :)

Dec 09 '21 05:12 shunyuzh

hi, I found that the number of bm25 negative samples and hard negative samples provided in nq-example is different, is this reasonable?

May 09 '22 04:05 tecmry