datasets icon indicating copy to clipboard operation
datasets copied to clipboard

FolderBase Dataset automatically resolves under current directory when data_dir is not specified

Open npuichigo opened this issue 2 years ago • 17 comments

Describe the bug

FolderBase Dataset automatically resolves under current directory when data_dir is not specified.

For example:

load_dataset("audiofolder")

takes long time to resolve and collect data_files from current directory. But I think it should reach out to this line for error handling https://github.com/huggingface/datasets/blob/cb8c5de5145c7e7eee65391cb7f4d92f0d565d62/src/datasets/packaged_modules/folder_based_builder/folder_based_builder.py#L58-L59

Steps to reproduce the bug

load_dataset("audiofolder")

Expected behavior

Error report

Environment info

  • datasets version: 2.14.4
  • Platform: Linux-5.15.0-78-generic-x86_64-with-glibc2.17
  • Python version: 3.8.15
  • Huggingface_hub version: 0.16.4
  • PyArrow version: 12.0.1
  • Pandas version: 1.5.3

npuichigo avatar Aug 16 '23 04:08 npuichigo

@lhoestq

npuichigo avatar Aug 16 '23 04:08 npuichigo

Makes sense, I guess this can be fixed in the load_dataset_builder method. It concerns every packaged builder I think (see values in _PACKAGED_DATASETS_MODULES)

lhoestq avatar Aug 16 '23 09:08 lhoestq

I think the behavior is related to these lines, which short circuited the error handling. https://github.com/huggingface/datasets/blob/664a1cb72ea1e6ef7c47e671e2686ca4a35e8d63/src/datasets/load.py#L946-L952

So should data_dir be checked here or still delegating to actual DatasetModule? In that case, how to properly set data_files here.

npuichigo avatar Aug 16 '23 13:08 npuichigo

This is location in PackagedDatasetModuleFactory.get_module seems the be the right place to check if at least data_dir or data_files are passed

lhoestq avatar Aug 16 '23 13:08 lhoestq

@mariosasko can you please assign this issue to me,I want to work on this

debrupf2946 avatar Oct 01 '23 04:10 debrupf2946

#self-assign

debrupf2946 avatar Oct 01 '23 05:10 debrupf2946

@mariosasko is this issue still open? i would love to kickstart my journey to open source with this issue! Regards zutarich

zutarich avatar Oct 09 '23 06:10 zutarich

@zutarich It is unless @debrupf2946 is working on it.

mariosasko avatar Oct 10 '23 16:10 mariosasko

#self-assign

Etelis avatar Jan 22 '24 14:01 Etelis

I am working and will open a pull request soon @Etelis

debrupf2946 avatar Jan 22 '24 15:01 debrupf2946

@mariosasko can i take this up?

JINO-ROHIT avatar Apr 03 '24 17:04 JINO-ROHIT

#self-assign

JINO-ROHIT avatar Apr 03 '24 17:04 JINO-ROHIT

Yes, feel free to work on this :)

mariosasko avatar Apr 03 '24 18:04 mariosasko

i think its working as expected . Heres the log i get for the same line -

image

JINO-ROHIT avatar Apr 04 '24 11:04 JINO-ROHIT