file missing when load_dataset with openwebtext on windows

Open xi-loong opened this issue 3 years ago • 1 comments

Describe the bug

0015896-b1054262f7da52a0518521e29c8e352c.txt is missing when I run run_mlm.py with openwebtext. I check the cache_path and can not find 0015896-b1054262f7da52a0518521e29c8e352c.txt. but I can find this file in the 17ecf461bfccd469a1fbc264ccb03731f8606eea7b3e2e8b86e13d18040bf5b3/urlsf_subset00-16_data.xz with 7-zip.

Steps to reproduce the bug

python run_mlm.py --model_type roberta --tokenizer_name roberta-base --dataset_name openwebtext --per_device_train_batch_size 8 --per_device_eval_batch_size 8 --do_train --do_eval --output_dir F:/model/roberta-base

from datasets import load_dataset
load_dataset("openwebtext", None, cache_dir=None, use_auth_token=None)

Expected results

Loading is successful

Actual results

Traceback (most recent call last): File "D:\Python\v3.8.5\lib\site-packages\datasets\builder.py", line 704, in download_and_prepare self._download_and_prepare( File "D:\Python\v3.8.5\lib\site-packages\datasets\builder.py", line 1227, in _download_and_prepare super()._download_and_prepare(dl_manager, verify_infos, check_duplicate_keys=verify_infos) File "D:\Python\v3.8.5\lib\site-packages\datasets\builder.py", line 795, in _download_and_prepare raise OSError( OSError: Cannot find data file. Original error: [Errno 22] Invalid argument: 'F://huggingface/datasets/downloads/extracted/0901d27f43b7e9ac0577da0d0061c8c632ba0b70ecd1b4bfb21562d9b7486faa/0015896-b1054262f7da52a0518521e29c8e352c.txt'

Environment info

datasets version: 2.4.0
Platform: windows
Python version: 3.8.5
PyArrow version: 9.0.0

Aug 16 '22 04:08 xi-loong

I have tried to extract 0015896-b1054262f7da52a0518521e29c8e352c.txt from 17ecf461bfccd469a1fbc264ccb03731f8606eea7b3e2e8b86e13d18040bf5b3/urlsf_subset00-16_data.xz with 7-zip and put the file into cache_path F://huggingface/datasets/downloads/extracted/0901d27f43b7e9ac0577da0d0061c8c632ba0b70ecd1b4bfb21562d9b7486faa there is still raise the same error and I find the file was removed from cache_path after I run the run_mlm.py with python run_mlm.py --model_type roberta --tokenizer_name roberta-base --dataset_name openwebtext --per_device_train_batch_size 8 --per_device_eval_batch_size 8 --do_train --do_eval --output_dir F:/model/roberta-base.

Aug 16 '22 08:08 xi-loong