datasets icon indicating copy to clipboard operation
datasets copied to clipboard

Columns in the dataset obtained though load_dataset do not correspond to the one in the dataset viewer since 3.4.0

Open bruno-hays opened this issue 10 months ago • 1 comments

Describe the bug

I have noticed that on my dataset named BrunoHays/Accueil_UBS, since the version 3.4.0, every column except audio is missing when I load the dataset.

Interestingly, the dataset viewer still shows the correct columns

Steps to reproduce the bug

from datasets import load_dataset
ds = load_dataset("BrunoHays/Accueil_UBS", streaming=True)
print(next(iter(ds["test"])).keys())

With datasets >= 3.4.0: -> dict_keys(['audio']) With datasets == 3.3.2: -> dict_keys(['audio', 'id', 'speaker', 'sentence', 'raw_sentence', 'start_timestamp', 'end_timestamp', 'overlap'])

Expected behavior

All the columns should be present

Environment info

  • datasets version: 3.3.2
  • Platform: macOS-14.6.1-x86_64-i386-64bit
  • Python version: 3.10.15
  • huggingface_hub version: 0.30.1
  • PyArrow version: 16.1.0
  • Pandas version: 1.5.3
  • fsspec version: 2023.10.0

bruno-hays avatar Apr 02 '25 17:04 bruno-hays

Hi, the dataset viewer shows all the possible columns and their types, but load_dataset() iterates through all the columns that you defined. It seems that you only have one column (‘audio’) defined in your dataset because when I ran print(ds.column_names), the only name I got was “audio”. You need to clearly define all the other features of the dataset as columns to enable your original code to work. Furthermore, you can run this code to print out all the features of your dataset:

from datasets import load_dataset_builder
ds_builder = load_dataset_builder("BrunoHays/Accueil_UBS")
print(ds_builder.info.features)

phoebecd avatar May 19 '25 13:05 phoebecd

@phoebecd Good catch, even in datasets<3.4.0, the only feature is "audio". This datasets follows the audio folder structure with metadata.csv. Maybe I missed something or there is a bug when having and audio_folder with a metadata file

What do you think @lhoestq ?

bruno-hays avatar Jul 01 '25 15:07 bruno-hays

I opened a PR to fix the issue :) https://huggingface.co/datasets/BrunoHays/Accueil_UBS/discussions/2

We expect the metadata file to be in the / folder now to allow one CSV metadata file per split. But in the PR I just added a manual configuration instead of moving the file and updating all the relative paths it contains.

lhoestq avatar Jul 01 '25 15:07 lhoestq