Columns in the dataset obtained though load_dataset do not correspond to the one in the dataset viewer since 3.4.0
Describe the bug
I have noticed that on my dataset named BrunoHays/Accueil_UBS, since the version 3.4.0, every column except audio is missing when I load the dataset.
Interestingly, the dataset viewer still shows the correct columns
Steps to reproduce the bug
from datasets import load_dataset
ds = load_dataset("BrunoHays/Accueil_UBS", streaming=True)
print(next(iter(ds["test"])).keys())
With datasets >= 3.4.0: -> dict_keys(['audio']) With datasets == 3.3.2: -> dict_keys(['audio', 'id', 'speaker', 'sentence', 'raw_sentence', 'start_timestamp', 'end_timestamp', 'overlap'])
Expected behavior
All the columns should be present
Environment info
-
datasetsversion: 3.3.2 - Platform: macOS-14.6.1-x86_64-i386-64bit
- Python version: 3.10.15
-
huggingface_hubversion: 0.30.1 - PyArrow version: 16.1.0
- Pandas version: 1.5.3
-
fsspecversion: 2023.10.0
Hi, the dataset viewer shows all the possible columns and their types, but load_dataset() iterates through all the columns that you defined. It seems that you only have one column (‘audio’) defined in your dataset because when I ran print(ds.column_names), the only name I got was “audio”. You need to clearly define all the other features of the dataset as columns to enable your original code to work. Furthermore, you can run this code to print out all the features of your dataset:
from datasets import load_dataset_builder
ds_builder = load_dataset_builder("BrunoHays/Accueil_UBS")
print(ds_builder.info.features)
@phoebecd Good catch, even in datasets<3.4.0, the only feature is "audio". This datasets follows the audio folder structure with metadata.csv. Maybe I missed something or there is a bug when having and audio_folder with a metadata file
What do you think @lhoestq ?
I opened a PR to fix the issue :) https://huggingface.co/datasets/BrunoHays/Accueil_UBS/discussions/2
We expect the metadata file to be in the