file missing when load_dataset with openwebtext on windows
Describe the bug
0015896-b1054262f7da52a0518521e29c8e352c.txt is missing when I run run_mlm.py with openwebtext. I check the cache_path and can not find 0015896-b1054262f7da52a0518521e29c8e352c.txt. but I can find this file in the 17ecf461bfccd469a1fbc264ccb03731f8606eea7b3e2e8b86e13d18040bf5b3/urlsf_subset00-16_data.xz with 7-zip.
Steps to reproduce the bug
python run_mlm.py --model_type roberta --tokenizer_name roberta-base --dataset_name openwebtext --per_device_train_batch_size 8 --per_device_eval_batch_size 8 --do_train --do_eval --output_dir F:/model/roberta-base
or
from datasets import load_dataset
load_dataset("openwebtext", None, cache_dir=None, use_auth_token=None)
Expected results
Loading is successful
Actual results
Traceback (most recent call last): File "D:\Python\v3.8.5\lib\site-packages\datasets\builder.py", line 704, in download_and_prepare self._download_and_prepare( File "D:\Python\v3.8.5\lib\site-packages\datasets\builder.py", line 1227, in _download_and_prepare super()._download_and_prepare(dl_manager, verify_infos, check_duplicate_keys=verify_infos) File "D:\Python\v3.8.5\lib\site-packages\datasets\builder.py", line 795, in _download_and_prepare raise OSError( OSError: Cannot find data file. Original error: [Errno 22] Invalid argument: 'F://huggingface/datasets/downloads/extracted/0901d27f43b7e9ac0577da0d0061c8c632ba0b70ecd1b4bfb21562d9b7486faa/0015896-b1054262f7da52a0518521e29c8e352c.txt'
Environment info
-
datasetsversion: 2.4.0 - Platform: windows
- Python version: 3.8.5
- PyArrow version: 9.0.0
I have tried to extract 0015896-b1054262f7da52a0518521e29c8e352c.txt from 17ecf461bfccd469a1fbc264ccb03731f8606eea7b3e2e8b86e13d18040bf5b3/urlsf_subset00-16_data.xz with 7-zip
and put the file into cache_path F://huggingface/datasets/downloads/extracted/0901d27f43b7e9ac0577da0d0061c8c632ba0b70ecd1b4bfb21562d9b7486faa
there is still raise the same error and I find the file was removed from cache_path after I run the run_mlm.py with python run_mlm.py --model_type roberta --tokenizer_name roberta-base --dataset_name openwebtext --per_device_train_batch_size 8 --per_device_eval_batch_size 8 --do_train --do_eval --output_dir F:/model/roberta-base.