load_dataset can't work with symbolic links
Feature request
Enable the load_dataset function to load local datasets with symbolic links.
E.g, this dataset can be loaded: ├── example_dataset/ │ ├── data/ │ │ ├── train/ │ │ │ ├── file0 │ │ │ ├── file1 │ │ ├── dev/ │ │ │ ├── file2 │ │ │ ├── file3 │ ├── metadata.csv
while this dataset can't: ├── example_dataset_symlink/ │ ├── data/ │ │ ├── train/ │ │ │ ├── sym0 -> file0 │ │ │ ├── sym1 -> file1 │ │ ├── dev/ │ │ │ ├── sym2 -> file2 │ │ │ ├── sym3 -> file3 │ ├── metadata.csv
I have created an example dataset in order to reproduce the problem:
- Unzip
example_dataset.zip. - Run
no_symlink.sh. Training should start without issues. - Run
symlink.sh. You will see that all four examples will be in train split, instead of having two examples in train and two examples in dev. The script won't load the correct audio files.
Motivation
I have a very large dataset locally. Instead of initiating training on the entire dataset, I need to start training on smaller subsets of the data. Due to the purpose of the experiments I am running, I will need to create many smaller datasets with overlapping data. Instead of copying the all the files for each subset, I would prefer copying symbolic links of the data. This way, the memory usage would not significantly increase beyond the initial dataset size.
Advantages of this approach:
- It would leave a smaller memory footprint on the hard drive
- Creating smaller datasets would be much faster
Your contribution
I would gladly contribute, if this is something useful to the community. It seems like a simple change of code, something like file_path = os.path.realpath(file_path) should be added before loading the files. If anyone has insights on how to incorporate this functionality, I would greatly appreciate your knowledge and input.
In fact,You can use a hard link instead of a symbolic link.Hard link works