LongBench icon indicating copy to clipboard operation
LongBench copied to clipboard

Load dataset from hf failed

Open murphypei opened this issue 1 year ago • 5 comments

datasets = ['hotpotqa', '2wikimqa', 'musique', 'narrativeqa', 'qasper', 'multifieldqa_en', 'gov_report', 'qmsum', 'trec', 'samsum', 'triviaqa', 'passage_count', 'passage_retrieval_en', 'multi_news']
for dataset in datasets:
        print(f"Loading dataset {dataset}")
        data = load_dataset("THUDM/LongBench", dataset, split="test")
        output_path = f"{output_dir}/pred/{dataset}.jsonl"

File "/usr/local/lib/python3.9/dist-packages/datasets/packaged_modules/cache/cache.py", line 65, in _find_hash_in_cache raise ValueError( ValueError: Couldn't find cache for THUDM/LongBench for config '2wikimqa' Available configs in the cache: ['dureader', 'hotpotqa', 'multifieldqa_en_e', 'qasper_e']

murphypei avatar Jul 16 '24 09:07 murphypei

Hi, can you try deleting the cached files and download all over again?

bys0318 avatar Jul 17 '24 04:07 bys0318

Hi, can you try deleting the cached files and download all over again?

yes, and I test many times in both local machine and docker environment. I don't known if you can reproduce this error, maybe this error is just my mistakes. Thanks for your reply.

Finally I was forced to download the jsonl file and load it from local disk and it works.

I can still use this dataset but I think this error may leading to reduced usage.

murphypei avatar Jul 17 '24 09:07 murphypei

Glad to hear you've loaded the dataset! Perhaps this error is due to a low datasets version. One can try update the package:

pip install -U datasets

bys0318 avatar Jul 17 '24 10:07 bys0318

Glad to hear you've loaded the dataset! Perhaps this error is due to a low datasets version. One can try update the package:

pip install -U datasets

I have already upgraded it to the lastest version but it didn't work. Maybe it's the huggingface issue?

murphypei avatar Jul 19 '24 02:07 murphypei

Hi there, downgrading datasets to 3.2.0 works for me. When using datasets==4.3.0, the following log shows up, and I can't load datasets properly. It's probably because huggingface no longer support remote script execution for dataset loading. Perhaps maintainers can consider updating the dataset to "a standard format like Parquet", as the log suggests? Image

SHA-4096 avatar Oct 30 '25 08:10 SHA-4096