PIKE-RAG icon indicating copy to clipboard operation
PIKE-RAG copied to clipboard

HotpotQA reproduce FAil: naivie F1 72.35%, atom F1 48.25%

Open charliedream1 opened this issue 9 months ago • 3 comments

I'm using Qwen2.5-72B-int4 to reproduce the performance, however, atomic_decompose.yml is much worse than qa_chunk.yml. Naive RAG is even 20% more than atom decompose one. What would be the problem, hope to hear from you soon!

charliedream1 avatar Apr 24 '25 06:04 charliedream1

Hi @charliedream1 ,

This sounds strange. A few things in my mind: Have you tried Zero-Shot CoT? The F1 score you get sounds very close to that of Zero-Shot CoT. So I doubt whether you successfully tagged the atomic questions into the original chunks, or whether you successfully retrieved the chunks in the process. You can confirm this by: (1) looking at the atomic questions in file "data/hotpotqa/dev_500_retrieval_contexts_as_chunks_with_atom_questions.jsonl"; (2) looking at the logs in file "logs/hotpotqa/atomic_decompose/atomic_decompose.jsonl" for the usage of retrieved documents in generation.

Jinyu-W avatar May 07 '25 01:05 Jinyu-W

I haven't try zero-shot cot. But I did see a lot of warning that nothing retrieved, I lower down the atom threshold from default 0.5 to 0.3, it improved 2%. It seems that it's not zero cot.

I will check again on what you mentioned.

charliedream1 avatar May 07 '25 02:05 charliedream1

  1. "data/hotpotqa/dev_500_retrieval_contexts_as_chunks_with_atom_questions.jsonl" This is the generated file, it looks good.
{"chunk_id": "Harbor Square-0-1", "title": "Harbor Square", "atom_questions": ["What is the original name of Harbor Square?", "What was the former name of the shopping center now known as Harbor Square?", "Where is Harbor Square located?", "What major road is Harbor Square accessible from?", "What is the exit number to reach Harbor Square from the Garden State Parkway?", "Who owns Harbor Square?", "What is the current gross leasable area of Harbor Square?", "What was the gross leasable area of Harbor Square when it was a mall?", "How much land does Harbor Square occupy?", "What are the anchor stores at Harbor Square?"], "content": "Harbor Square, formerly Shore Mall, is a shopping plaza (formerly a shopping mall) in Egg Harbor Township, New Jersey on U.S. Route 40/U.S. Route 322 originally known as \"Searstown\". The plaza is accessible from Exit 36 off the Garden State Parkway. The plaza is owned by Aetna Realty. The plaza has a gross leasable area of 337,423 ft², formerly 620,000 ft² when it was a mall, located on 73 acre of land. The plaza's anchor stores include Boscov's and Burlington Coat Factory."}
{"chunk_id": "Smith Road, Chennai-0-1", "title": "Smith Road, Chennai", "atom_questions": ["What is the starting point of Smith Road in Chennai?", "Where does Smith Road end in Chennai?", "What major road does Smith Road branch off from in Chennai?", "Near which landmark does Smith Road branch off from Anna Salai?", "What is the name of the school near which Smith Road joins Whites Road?", "In which Indian state is Smith Road located?", "What is the name of the arterial road in Chennai mentioned in the content?"], "content": "Smith Road in Chennai, Tamil Nadu, India branches off from Anna Salai, Chennai's arterial road near Spencer Plaza from the TVS Junction to join Whites Road near Hobart Muslim Girls Higher Secondary School."}
  1. for logs "logs/hotpotqa/atomic_decompose/atomic_decompose.jsonl" I didn't see anything abnormal.

  2. I tested several methods, results as below:

                                 F1
zero-cot                  39.43%
sefl-ask-chunk        54.16
self-ask-H-R            52.87%
ATOM decomp        48.25%
naive                       72.35%

It looks generations in tagging process is not good compared of using chunk directly. And model is not good at continue asking.

The results is quite different from README. What could be the reason?

charliedream1 avatar May 09 '25 09:05 charliedream1