PIKE-RAG icon indicating copy to clipboard operation
PIKE-RAG copied to clipboard

Unable to Create a Knowledge Base Successfully Based on the Documentation.

Open JV-X opened this issue 11 months ago • 5 comments

Hello, I'm currently trying to create a knowledge base and interact with it based on the documentation. I have reached the third step. According to the documentation https://github.com/microsoft/PIKE-RAG/blob/main/docs/guides/examples.md, I successfully executed chunking.py and obtained a .pkl file. However, in the next step, when I tried to run tagging.py, I encountered an issue.

As per the documentation, I need to pass a .yml file as an argument. Since I'm not sure how to configure each field in the .yml file, I used the example examples/hotpotqa/configs/tagging.yml provided in the documentation. However, I received the following error: FileNotFoundError: [Errno 2] No such file or directory: 'data/hotpotqa/dev_500_retrieval_contexts_as_chunks.jsonl'

The file dev_500_retrieval_contexts_as_chunks.jsonl is configured in tagging.yml, but I couldn't find this file in the project. I'm also unsure about what content should be in this file or how to obtain this file.

Could you please provide some assistance or a complete executable example for reference?

thanks for your help.

JV-X avatar Feb 21 '25 11:02 JV-X

Hi, I just ran the tagging.py. the default setting just use llm to extract question from the content.I will give u a brief sample : { "chunk_id": "ecdabd6e70514aa6b5ec8050ecbc125a", "content": "Edward Davis Wood Jr. (October 10, 1924 – December 10, 1978) was an American filmmaker, actor, writer, producer, and director.", "title":"Edward Davis" }. the default taggging.py will use llm with prompt to extract question that the content can answer. maybe the result like this : { "chunk_id":"ecdabd6e70514aa6b5ec8050ecbc125a", "title": "Edward Davis", "atom_questions": ["What is the full name of Edward Davis?", "When was Edward Davis Wood Jr. born?", "When did Edward Davis Wood Jr. die?", "What professions was Edward Davis Wood Jr. known for?"]. I in tagging yaml there is one config like this tagger: tagging_protocol: module_path: pikerag.prompts.tagging attr_name: atom_question_tagging_protocol tag_name: atom_questions if u check the folder: pikerag/prompts/tagging u will find semantic_tagging.py and atom_question_tagging.py. I think the semantic_tagging.py can extract tagging classes and written_phrase_mapping.py will use these tagging classes to rewrite the input query.

whatCanIsay321 avatar Feb 22 '25 16:02 whatCanIsay321

Hi, I just ran the tagging.py. the default setting just use llm to extract question from the content.I will give u a brief sample : { "chunk_id": "ecdabd6e70514aa6b5ec8050ecbc125a", "content": "Edward Davis Wood Jr. (October 10, 1924 – December 10, 1978) was an American filmmaker, actor, writer, producer, and director.", "title":"Edward Davis" }. the default taggging.py will use llm with prompt to extract question that the content can answer. maybe the result like this : { "chunk_id":"ecdabd6e70514aa6b5ec8050ecbc125a", "title": "Edward Davis", "atom_questions": ["What is the full name of Edward Davis?", "When was Edward Davis Wood Jr. born?", "When did Edward Davis Wood Jr. die?", "What professions was Edward Davis Wood Jr. known for?"]. I in tagging yaml there is one config like this tagger: tagging_protocol: module_path: pikerag.prompts.tagging attr_name: atom_question_tagging_protocol tag_name: atom_questions if u check the folder: pikerag/prompts/tagging u will find semantic_tagging.py and atom_question_tagging.py. I think the semantic_tagging.py can extract tagging classes and written_phrase_mapping.py will use these tagging classes to rewrite the input query.

Thank you for your patient reply! Based on your hints, I have successfully executed tagging.py now. However, the output log looks like this:

(pike_rag) hygx@hygx:~/code/PIKE-RAG$ python examples/tagging.py examples/hotpotqa/configs/test_tagging.yml  
Tagging Documents: 0it [00:00, ?it/s]  
(pike_rag) hygx@hygx:~/code/PIKE-RAG$  

The generated test_tagging_with_atom_questions.jsonl.jsonl file is also empty. I suspect there might be an issue with my configuration (for example, do I need to pass the pkl file obtained from executing chunking.py when tagging?), but I'm not sure what the correct configuration should be. Could you help me take a look?

JV-X avatar Feb 24 '25 03:02 JV-X

I've encountered the same problem. I don't know where to get test_tagging_with_atom_questions.jsonl. I hope someone can provide an answer.

liyubo-debug avatar Feb 25 '25 02:02 liyubo-debug

I've encountered the same problem. I don't know where to get test_tagging_with_atom_questions.jsonl. I hope someone can provide an answer.

The executed commands and the corresponding return results are as follows. python examples/tagging.py examples/hotpotqa/configs/tagging.yml FileNotFoundError: [Errno 2] No such file or directory: 'data/hotpotqa/dev_500_retrieval_contexts_as_chunks.jsonl'

liyubo-debug avatar Feb 25 '25 02:02 liyubo-debug

I've encountered the same problem. I don't know where to get test_tagging_with_atom_questions.jsonl. I hope someone can provide an answer.

The executed commands and the corresponding return results are as follows. python examples/tagging.py examples/hotpotqa/configs/tagging.yml FileNotFoundError: [Errno 2] No such file or directory: 'data/hotpotqa/dev_500_retrieval_contexts_as_chunks.jsonl'

you need run some script in PIKE-RAG/data_process firstly: run main.py to download dataset, then split and sample them. run retrieval_contexts_as_chunks.py to get chunks, then u can get the dev_50_retrieval_contexts_as_chunks.jsonl

DevRo886 avatar Feb 27 '25 16:02 DevRo886

I've encountered the same problem. I don't know where to get test_tagging_with_atom_questions.jsonl. I hope someone can provide an answer.

The executed commands and the corresponding return results are as follows. python examples/tagging.py examples/hotpotqa/configs/tagging.yml FileNotFoundError: [Errno 2] No such file or directory: 'data/hotpotqa/dev_500_retrieval_contexts_as_chunks.jsonl'

you need run some script in PIKE-RAG/data_process firstly: run main.py to download dataset, then split and sample them. run retrieval_contexts_as_chunks.py to get chunks, then u can get the dev_50_retrieval_contexts_as_chunks.jsonl

How to use my own data? For example, the data used in the previous chunk step, because I want to test the QA of my own data after processing.

SevenMpp avatar Mar 26 '25 06:03 SevenMpp

Thank you all for your interest in our work!

@JV-X @liyubo-debug We newly uploaded a document describing the whole process to run experiment on MuSiQue with all scripts and configs ready. Introduction for other necessary steps are also listed in this document.

Jinyu-W avatar Apr 08 '25 03:04 Jinyu-W

I've encountered the same problem. I don't know where to get test_tagging_with_atom_questions.jsonl. I hope someone can provide an answer.

The executed commands and the corresponding return results are as follows. python examples/tagging.py examples/hotpotqa/configs/tagging.yml FileNotFoundError: [Errno 2] No such file or directory: 'data/hotpotqa/dev_500_retrieval_contexts_as_chunks.jsonl'

you need run some script in PIKE-RAG/data_process firstly: run main.py to download dataset, then split and sample them. run retrieval_contexts_as_chunks.py to get chunks, then u can get the dev_50_retrieval_contexts_as_chunks.jsonl

How to use my own data? For example, the data used in the previous chunk step, because I want to test the QA of my own data after processing.

@SevenMpp Generally there are two ways to handle your need:

  1. Convert your data to align the format as those used in PIKE-RAG. Sorry that it's not easy for us to release data directly due to some business issue. You can run the MuSiQue example following this document to get the data for each format as reference. You can modify the sample_size in data preprocess yaml config to some small value like 2 to reduce the cost.
  2. Write the data_loading utils functions for chunking script, qa script and update the module path in yaml config filesrespectively. These data loading utils are used to load your data and concert they into the corresponding Python object that used in PIKE-RAG.

Thank you for your interest in our work!

Jinyu-W avatar Apr 08 '25 03:04 Jinyu-W