datasets icon indicating copy to clipboard operation
datasets copied to clipboard

datasets cannot handle nested json if features is given.

Open ljw20180420 opened this issue 1 year ago • 1 comments

Describe the bug

I have a json named temp.json.

{"ref1": "ABC", "ref2": "DEF", "cuts":[{"cut1": 3, "cut2": 5}]}

I want to load it.

ds = datasets.load_dataset('json', data_files="./temp.json", features=datasets.Features({
    'ref1': datasets.Value('string'),
    'ref2': datasets.Value('string'),
    'cuts': datasets.Sequence({
        "cut1": datasets.Value("uint16"),
        "cut2": datasets.Value("uint16")
    })
}))

The above code does not work. However, I can load it without giving features.

ds = datasets.load_dataset('json', data_files="./temp.json")

Is it possible to load integers as uint16 to save some memory?

Steps to reproduce the bug

As in the bug description.

Expected behavior

The data are loaded and integers are uint16.

Environment info

Copy-and-paste the text below in your GitHub issue.

  • datasets version: 2.21.0
  • Platform: Linux-5.15.0-118-generic-x86_64-with-glibc2.35
  • Python version: 3.11.9
  • huggingface_hub version: 0.24.5
  • PyArrow version: 17.0.0
  • Pandas version: 2.2.2
  • fsspec version: 2024.5.0

ljw20180420 avatar Aug 20 '24 12:08 ljw20180420

Hi ! Sequence has a weird behavior for dictionaries (from tensorflow-datasets), use a regular list instead:

ds = datasets.load_dataset('json', data_files="./temp.json", features=datasets.Features({
    'ref1': datasets.Value('string'),
    'ref2': datasets.Value('string'),
    'cuts': [{
        "cut1": datasets.Value("uint16"),
        "cut2": datasets.Value("uint16")
    }]
}))

lhoestq avatar Aug 22 '24 15:08 lhoestq

Hi ! Sequence has a weird behavior for dictionaries (from tensorflow-datasets), use a regular list instead:

ds = datasets.load_dataset('json', data_files="./temp.json", features=datasets.Features({
    'ref1': datasets.Value('string'),
    'ref2': datasets.Value('string'),
    'cuts': [{
        "cut1": datasets.Value("uint16"),
        "cut2": datasets.Value("uint16")
    }]
}))

Thank you!

ljw20180420 avatar Sep 03 '24 10:09 ljw20180420

It works.

ljw20180420 avatar Sep 03 '24 10:09 ljw20180420