datasets icon indicating copy to clipboard operation
datasets copied to clipboard

Some of DownloadConfig's properties are always being overridden in load.py

Open ductai199x opened this issue 1 year ago • 0 comments

Describe the bug

The extract_compressed_file and force_extract properties of DownloadConfig are always being set to True in the function dataset_module_factory in the load.py file. This behavior is very annoying because data extracted will just be ignored the next time the dataset is loaded.

See this image below: image

Steps to reproduce the bug

  1. Have a local dataset that contains archived files (zip, tar.gz, etc)
  2. Build a dataset loading script to download and extract these files
  3. Run the load_dataset function with a DownloadConfig that specifically set force_extract to False
  4. The extraction process will start no matter if the archives was extracted previously

Expected behavior

The extraction process should not run when the archives were previously extracted and force_extract is set to False.

Environment info

datasets==2.20.0 python3.9

ductai199x avatar Aug 09 '24 18:08 ductai199x