tevatron icon indicating copy to clipboard operation
tevatron copied to clipboard

How to reproduce the results on NQ

Open shunyuzh opened this issue 4 years ago • 13 comments

Hi, @luyug

Thanks for your awesome work. Is it possible to give more details to reproduce the results (84.3=MRR@5) on NQ in the paper, just like the detailed MS MARCO tutorial demo?

Looking forward to your reply. Thanks.

shunyuzh avatar Oct 14 '21 03:10 shunyuzh

Hello,

Thanks for your interest! I am currently working on building JAX interfaces for tevatron. It will take a week or so before I can get back to add more examples and do quality assurance.

Do note that we used a private version of dpr instead of tevatron to run QA experiments in the papers, which help make sure we have aligned evaluation.

luyug avatar Oct 18 '21 14:10 luyug

Hello,

Thanks for your interest! I am currently working on building JAX interfaces for tevatron. It will take a week or so before I can get back to add more examples and do quality assurance.

Do note that we used a private version of dpr instead of tevatron to run QA experiments in the papers, which help make sure we have aligned evaluation.

Thanks for your reply! Looking forward to your released pipeline.

Another issue is that whether your SOTA model on NQ is trained with only mined hard negatives or with both BM hard negatives and mined hard negatives as DPR github? Though it seems to be only mined hard negatives in your paper, I want to check again for I found it better with both BM hard negatives and mined hard negatives in my experiments.

shunyuzh avatar Oct 19 '21 07:10 shunyuzh

I used both BM25 and mined negatives, aligning with the DPR setup. I don't have thorough experiments on whether it is better to include or not to include BM25 negatives.

luyug avatar Oct 19 '21 14:10 luyug

Hi, @luyug

Thank you very much. Sorry to disturb again.

I wonder if the Tevatron Toolkit can run experiment on MS MARCO document ranking set? Or could you share how to reproduce the results on MS-MARCO Documen in the paper Condenser: a Pre-training Architecture for Dense Retrieval ?

shunyuzh avatar Oct 27 '21 06:10 shunyuzh

The data for document ranking is structured in a similar way as the passage ranking dataset. I think the easiest thing is probably to follow the passage ranking example and swap in the document ranking data.

luyug avatar Oct 27 '21 21:10 luyug

The data for document ranking is structured in a similar way as the passage ranking dataset. I think the easiest thing is probably to follow the passage ranking example and swap in the document ranking data.

Thanks for such a quick reply. I will try it.

shunyuzh avatar Oct 29 '21 04:10 shunyuzh

Hi, @luyug I find it hard to reproduce your results for NQ and TQA, and it is lower about 0.5-1.5 points than your reported. I wonder if it's possible to get your mined hard negatives of NQ and TQA? My mail is [email protected]

shunyuzh avatar Nov 02 '21 12:11 shunyuzh

I'm currently a little bit busy, working on several paper/school work deadlines. Will get back to you in a few weeks.

luyug avatar Nov 11 '21 20:11 luyug

@Dopaminezsy I am assuming you are using tevatron for DPR repro? I wonder if you have tried using the original DPR toolkit. We are triaging some issues with DPR on tevatron right now.

luyug avatar Nov 30 '21 18:11 luyug

Sorry to reply later. I am now following the original DPR toolkit rather than Tevatron.

So were your results totally based on Tevatron? I have noticed your shared results is already higher than original DPR toolkit.

shunyuzh avatar Dec 07 '21 12:12 shunyuzh

This is unexpected. I intentionally used the DPR toolkit in the paper to make sure the results are comparable to previous works. Locally I have been seeing rather stable results with DPR. I mentioned tevatron because we were having some unexpected regressions there.

If you are using DPR, make sure you are running the exact same setup as in the original DPR instruction, including 8GPUs, 16 batch size per device, DDP training, etc..

luyug avatar Dec 07 '21 20:12 luyug

Thanks, I will check them. In addition, I think the code about how to mine hard-negatives will help, for potentially different details. Another is can you share the first round results with only BM25 negatives on NQ, which may also help :)

shunyuzh avatar Dec 09 '21 05:12 shunyuzh

hi, I found that the number of bm25 negative samples and hard negative samples provided in nq-example is different, is this reasonable?

tecmry avatar May 09 '22 04:05 tecmry