FolderBase Dataset automatically resolves under current directory when data_dir is not specified
Describe the bug
FolderBase Dataset automatically resolves under current directory when data_dir is not specified.
For example:
load_dataset("audiofolder")
takes long time to resolve and collect data_files from current directory. But I think it should reach out to this line for error handling https://github.com/huggingface/datasets/blob/cb8c5de5145c7e7eee65391cb7f4d92f0d565d62/src/datasets/packaged_modules/folder_based_builder/folder_based_builder.py#L58-L59
Steps to reproduce the bug
load_dataset("audiofolder")
Expected behavior
Error report
Environment info
-
datasetsversion: 2.14.4 - Platform: Linux-5.15.0-78-generic-x86_64-with-glibc2.17
- Python version: 3.8.15
- Huggingface_hub version: 0.16.4
- PyArrow version: 12.0.1
- Pandas version: 1.5.3
@lhoestq
Makes sense, I guess this can be fixed in the load_dataset_builder method.
It concerns every packaged builder I think (see values in _PACKAGED_DATASETS_MODULES)
I think the behavior is related to these lines, which short circuited the error handling. https://github.com/huggingface/datasets/blob/664a1cb72ea1e6ef7c47e671e2686ca4a35e8d63/src/datasets/load.py#L946-L952
So should data_dir be checked here or still delegating to actual DatasetModule? In that case, how to properly set data_files here.
This is location in PackagedDatasetModuleFactory.get_module seems the be the right place to check if at least data_dir or data_files are passed
@mariosasko can you please assign this issue to me,I want to work on this
#self-assign
@mariosasko is this issue still open? i would love to kickstart my journey to open source with this issue! Regards zutarich
@zutarich It is unless @debrupf2946 is working on it.
#self-assign
I am working and will open a pull request soon @Etelis
@mariosasko can i take this up?
#self-assign
Yes, feel free to work on this :)
i think its working as expected . Heres the log i get for the same line -