LongRAG icon indicating copy to clipboard operation
LongRAG copied to clipboard

Question about process_wiki_page.py

Open gw16 opened this issue 1 year ago • 1 comments

Hello,

I downloaded the wiki raw dataset you previously mentioned and ran process_wiki_page.py with the following command:

python process_wiki_page.py --dir_path './bz_file' --output_path_dir './result' --corpus_title_path './psgs_w100.tsv'

The bz_file directory contains the file enwiki-20181220-pages-articles.xml.bz2 along with one other .bz2 file.

However, I encountered the following error:

Processing wiki files:   0%|                                                                               | 0/2 [00:00<?, ?it/s]
multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/multiprocessing/pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
  File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/multiprocessing/pool.py", line 48, in mapstar
    return list(map(*args))
  File "/Users/gawonlee/Desktop/06_school/LongRAG/preprocess/process_wiki_page.py", line 33, in process_wiki
    page_data = json.loads(line_decoded)
  File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/json/__init__.py", line 346, in loads
    return _default_decoder.decode(s)
  File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/json/decoder.py", line 337, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/json/decoder.py", line 355, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/Users/gawonlee/Desktop/06_school/LongRAG/preprocess/process_wiki_page.py", line 117, in <module>
    processed_data = util.process_data()
  File "/Users/gawonlee/Desktop/06_school/LongRAG/utils/mp_util.py", line 25, in process_data
    result_chunks = pool.map(self.func, data_chunks)
  File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/multiprocessing/pool.py", line 364, in map
    return self._map_async(func, iterable, mapstar, chunksize).get()
  File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/multiprocessing/pool.py", line 771, in get
    raise self._value
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

Why is this happening? Could you explain the process of creating the dataset in more detail?

gw16 avatar Aug 16 '24 06:08 gw16

Sure! It seems this error occurs during the data loading stage, before the dataset creation. Could you please try loading any .bz2 file (just a random one) to see if it works?

with bz2.open(file_path, "rb") as file:
       for line in file:
            line_decoded = _normalize(line.decode('utf-8'))
            page_data = json.loads(line_decoded)

I don't keep this very raw data locally. If you try loading the .bz2 file and it doesn't work, I can re-download it and send it to you.

Just a bit of context: the .bz2 raw data contains information for each page per line, referred to as page_data. The following code processes this data to extract information such as each page's text and hyperlinks. I saved the processed output on Hugging Face.

https://huggingface.co/datasets/TIGER-Lab/LongRAG/viewer/nq_wiki

XMHZZ2018 avatar Aug 29 '24 23:08 XMHZZ2018