datasets icon indicating copy to clipboard operation
datasets copied to clipboard

Cannot create a dataset with relative audio path

Open sedol1339 opened this issue 1 year ago • 4 comments

Describe the bug

Hello! I want to create a dataset of parquet files, with audios stored as separate .mp3 files. However, it says "No such file or directory" (see the reproducing code).

Steps to reproduce the bug

Creating a dataset

from pathlib import Path
from datasets import Dataset, load_dataset, Audio

Path('my_dataset/audio').mkdir(parents=True, exist_ok=True)
Path('my_dataset/audio/file.mp3').touch(exist_ok=True)
Dataset.from_list(
    [{'audio': {'path': 'audio/file.mp3'}}]
).to_parquet('my_dataset/data.parquet')

Result:

# my_dataset
# ├── audio
# │   └── file.mp3
# └── data.parquet

Trying to load the dataset

dataset = (
    load_dataset('my_dataset', split='train')
    .cast_column('audio', Audio(sampling_rate=16_000))
)
dataset[0]

>>> FileNotFoundError: [Errno 2] No such file or directory: 'audio/file.mp3'

Expected behavior

I expect the dataset to load correctly.

I've found 2 workarounds, but they are not very good:

  1. I can specify an absolute path to the audio, however, when I move the folder or upload to HF it will stop working.
  2. I can set 'path': 'file.mp3', and load with load_dataset('my_dataset', data_dir='audio') - it seems to work, but does this mean that anyone from Hugging Face who wants to use this dataset should also pass the data_dir argument, otherwise it won't work?

Environment info

datasets 3.1.0, Ubuntu 24.04.1

sedol1339 avatar Dec 09 '24 07:12 sedol1339

Hello ! when you cast_column you need the paths to be absolute paths or relative paths to your working directory, not the original dataset directory.

Though I'd recommend structuring your dataset as an AudioFolder which automatically links a metadata.jsonl or csv to the audio files via relative paths within the dataset repository: https://huggingface.co/docs/datasets/v3.2.0/en/audio_load#audiofolder

lhoestq avatar Dec 11 '24 13:12 lhoestq

@lhoestq thank you, but there are two problems with using AudioFolder:

  1. It is said that AudioFolder requires metadata.csv. However, my datset is too large and contains nested and np.ndarray fields, so I can't use csv.
  2. It is said that I need to load the dataset with load_dataset("audiofolder", ...). However, if possible, I want my dataset to be loaded as usual with load_dataset(dataset_name) after I upload if to HF.

sedol1339 avatar Dec 11 '24 17:12 sedol1339

You can use metadata.jsonl if you have nested data :)

And actually if you have a dataset structured as an AudioFolder then load_dataset(dataset_name) does work after uploading to HF

lhoestq avatar Dec 12 '24 13:12 lhoestq

I have created an audio dataset. In my repo, I have explained the steps and structure. An example dataset is also available in the repo. https://github.com/pr0mila/ParquetToHuggingFace

pr0mila avatar Apr 19 '25 07:04 pr0mila