LongRAG icon indicating copy to clipboard operation
LongRAG copied to clipboard

Could not reproduce the answer recall for NQ dataset

Open tyu008 opened this issue 1 year ago • 4 comments

Hi, I load nq/full-00000-of-00001.parquet and compute the answer recall based on

answers, context = item["answer"], item["context"] is_retrieval = has_correct_answer(context, answers)

I could only get 0.8532 answer recall, which is below the reported number "88.53" in Table 1 of the paper.

tyu008 avatar Aug 09 '24 23:08 tyu008

Hi @tyu008, thanks for raising this question. First, the nq/full-00000-of-00001.parquet corresponds to the num_retrieval_units = 4 line, not the num_retrieval_units = 8 line. The QA performance has shown that the ideal long context for existing LLMs is around 30K. If we use a longer context, even if the retrieval performance is higher, the final QA result will degrade (as shown in Figure 3). Therefore, the correct target number is 86.30 rather than 88.53. I will mark this more clearly in the repository.

Second, you are right. The current version's retrieval accuracy is 85.30. There is still a one-point gap between 85.30 and the 86.30 reported in the paper. I suspect I may have uploaded an older version of the final result. I will take a look and upload the new one.

Thanks again for pointing it out!

XMHZZ2018 avatar Aug 10 '24 02:08 XMHZZ2018

Hi, @XMHZZ2018, thanks so much for your quick reply. I am also trying to reproduce the results from max-P method in NQ dataset. Following the paper, I divide each group into 512-token snippet, and use the maximum snippet similarity as the similarity for the group. But I could only obtain 67.5% answer recall using top-1 group, which is below the reported 71.69 in the paper. I am using the exact same model BAAI/bge-large-en-v1.5 with fp_16. I guess I might have some misalignment with you in chunking the group. May you share the cropped 512-token snippets? Thanks again

tyu008 avatar Aug 10 '24 03:08 tyu008

@tyu008 Sure! I think I know the main reason here. I avoid cross-document chunking; for example, if a new document starts, it will be in the next chunk. Previously, when I did cross-document chunking, I observed about a 5% to 10% degradation. (This issue becomes even more severe in HotpotQA since the document length is even smaller.) I assume the same thing happened to you. I will upload my chunking file to Hugging face soon so you can reproduce the results. I will ping you here after I finished.

XMHZZ2018 avatar Aug 10 '24 06:08 XMHZZ2018

@XMHZZ2018 Got it! Thanks a lot for your quick reply! Look forward to your chunking file.

tyu008 avatar Aug 10 '24 16:08 tyu008