Cannot create a dataset with relative audio path
Describe the bug
Hello! I want to create a dataset of parquet files, with audios stored as separate .mp3 files. However, it says "No such file or directory" (see the reproducing code).
Steps to reproduce the bug
Creating a dataset
from pathlib import Path
from datasets import Dataset, load_dataset, Audio
Path('my_dataset/audio').mkdir(parents=True, exist_ok=True)
Path('my_dataset/audio/file.mp3').touch(exist_ok=True)
Dataset.from_list(
[{'audio': {'path': 'audio/file.mp3'}}]
).to_parquet('my_dataset/data.parquet')
Result:
# my_dataset
# ├── audio
# │ └── file.mp3
# └── data.parquet
Trying to load the dataset
dataset = (
load_dataset('my_dataset', split='train')
.cast_column('audio', Audio(sampling_rate=16_000))
)
dataset[0]
>>> FileNotFoundError: [Errno 2] No such file or directory: 'audio/file.mp3'
Expected behavior
I expect the dataset to load correctly.
I've found 2 workarounds, but they are not very good:
- I can specify an absolute path to the audio, however, when I move the folder or upload to HF it will stop working.
- I can set
'path': 'file.mp3', and load withload_dataset('my_dataset', data_dir='audio')- it seems to work, but does this mean that anyone from Hugging Face who wants to use this dataset should also pass thedata_dirargument, otherwise it won't work?
Environment info
datasets 3.1.0, Ubuntu 24.04.1
Hello ! when you cast_column you need the paths to be absolute paths or relative paths to your working directory, not the original dataset directory.
Though I'd recommend structuring your dataset as an AudioFolder which automatically links a metadata.jsonl or csv to the audio files via relative paths within the dataset repository: https://huggingface.co/docs/datasets/v3.2.0/en/audio_load#audiofolder
@lhoestq thank you, but there are two problems with using AudioFolder:
- It is said that AudioFolder requires metadata.csv. However, my datset is too large and contains nested and np.ndarray fields, so I can't use csv.
- It is said that I need to load the dataset with
load_dataset("audiofolder", ...). However, if possible, I want my dataset to be loaded as usual withload_dataset(dataset_name)after I upload if to HF.
You can use metadata.jsonl if you have nested data :)
And actually if you have a dataset structured as an AudioFolder then load_dataset(dataset_name) does work after uploading to HF
I have created an audio dataset. In my repo, I have explained the steps and structure. An example dataset is also available in the repo. https://github.com/pr0mila/ParquetToHuggingFace