load_dataset can't work with symbolic links

Open VladimirVincan opened this issue 1 year ago • 1 comments

Feature request

Enable the load_dataset function to load local datasets with symbolic links.

E.g, this dataset can be loaded: ├── example_dataset/ │ ├── data/ │ │ ├── train/ │ │ │ ├── file0 │ │ │ ├── file1 │ │ ├── dev/ │ │ │ ├── file2 │ │ │ ├── file3 │ ├── metadata.csv

while this dataset can't: ├── example_dataset_symlink/ │ ├── data/ │ │ ├── train/ │ │ │ ├── sym0 -> file0 │ │ │ ├── sym1 -> file1 │ │ ├── dev/ │ │ │ ├── sym2 -> file2 │ │ │ ├── sym3 -> file3 │ ├── metadata.csv

I have created an example dataset in order to reproduce the problem:

Unzip example_dataset.zip.
Run no_symlink.sh. Training should start without issues.
Run symlink.sh. You will see that all four examples will be in train split, instead of having two examples in train and two examples in dev. The script won't load the correct audio files.

example_dataset.zip

Motivation

I have a very large dataset locally. Instead of initiating training on the entire dataset, I need to start training on smaller subsets of the data. Due to the purpose of the experiments I am running, I will need to create many smaller datasets with overlapping data. Instead of copying the all the files for each subset, I would prefer copying symbolic links of the data. This way, the memory usage would not significantly increase beyond the initial dataset size.

Advantages of this approach:

It would leave a smaller memory footprint on the hard drive
Creating smaller datasets would be much faster

Your contribution

I would gladly contribute, if this is something useful to the community. It seems like a simple change of code, something like file_path = os.path.realpath(file_path) should be added before loading the files. If anyone has insights on how to incorporate this functionality, I would greatly appreciate your knowledge and input.

Mar 29 '24 17:03 VladimirVincan

In fact,You can use a hard link instead of a symbolic link.Hard link works

Apr 29 '25 15:04 TangGuohh