datasets icon indicating copy to clipboard operation
datasets copied to clipboard

Audio.cast_column() or Audio.decode_example() causes Colab kernel crash (std::bad_alloc)

Open rachidio opened this issue 3 months ago • 8 comments

Describe the bug

When using the huggingface datasets.Audio feature to decode a local or remote (public HF dataset) audio file inside Google Colab, the notebook kernel crashes with std::bad_alloc (C++ memory allocation failure). The crash happens even with a minimal code example and valid .wav file that can be read successfully using soundfile.

Here is a sample Collab notebook to reproduce the problem. https://colab.research.google.com/drive/1nnb-GC5748Tux3xcYRussCGp2x-zM9Id?usp=sharing

code sample:

...
audio_dataset = audio_dataset.cast_column("audio", Audio(sampling_rate=16000))

# Accessing the first element crashes the Colab kernel
print(audio_dataset[0]["audio"])

Error log

WARNING what(): std::bad_alloc
terminate called after throwing an instance of 'std::bad_alloc'

Environment

Platform: Google Colab (Python 3.12.12) datasets Version: 4.3.0 soundfile Version: 0.13.1 torchaudio Version: 2.8.0+cu126

Thanks in advance to help me on this error I get approx two weeks now after it was working before.

Regards

Steps to reproduce the bug

https://colab.research.google.com/drive/1nnb-GC5748Tux3xcYRussCGp2x-zM9Id?usp=sharing

Expected behavior

Loading the audio and decode it. It should safely return:

{ "path": "path/filaname.wav", "array": np.ndarray([...]), "sampling_rate": 16000 }

Environment info

Environment

Platform: Google Colab (Python 3.12.12) datasets Version: 4.3.0 soundfile Version: 0.13.1 torchaudio Version: 2.8.0+cu126

rachidio avatar Oct 27 '25 22:10 rachidio

Hi ! datasets v4 uses torchcodec for audio decoding (previous versions were using soundfile). What is your torchcodec version ? Can you try other versions of torchcodec and see if it works ?

lhoestq avatar Oct 28 '25 15:10 lhoestq

When I install datasets with pip install datasets[audio] it install this version of torchcodec:

Name: torchcodec
Version: 0.8.1

Can you please point to a working version of torchcodec?

Thanks for your help

rachidio avatar Oct 28 '25 23:10 rachidio

I believe you simply need to make sure the torchcodec and torch versions work together. Here is how to fix it:

!pip install -U torchcodec torch

lhoestq avatar Oct 29 '25 13:10 lhoestq

I am also encountering this same issue when i run print(ug_court["train"][0]) to view the features of the first row of my audio data

angelocodes avatar Oct 30 '25 12:10 angelocodes

the problem still goes on to when i force training with seeing these features

angelocodes avatar Oct 30 '25 12:10 angelocodes

Thank you @lhoestq I've reinstalled the packages an the error is gone. My new versions are:

Name: torch
Version: 2.8.0
---
Name: torchaudio
Version: 2.8.0
---
Name: torchcodec
Version: 0.8.1

Regards

rachidio avatar Oct 30 '25 22:10 rachidio

mine too has worked

angelocodes avatar Oct 30 '25 22:10 angelocodes

Hi,

I encounter the same problem when trying to inspect the first element in the dataset. My environment is:

root@3ac6f9f8c6c4:/workspace# pip3 list | grep torch
pytorch-lightning         2.5.6
pytorch-metric-learning   2.9.0
torch                     2.8.0+cu126
torch-audiomentations     0.12.0
torch_pitch_shift         1.2.5
torchaudio                2.8.0+cu126
torchcodec                0.8.1
torchelastic              0.2.2
torchmetrics              1.8.2
torchvision               0.23.0+cu126

the same as @rachidio 's new version that works.

I am in a Docker container environment, and here is the code I am working with:

Image

GryffindorLi avatar Nov 15 '25 16:11 GryffindorLi