datasets
datasets copied to clipboard
datasets cannot handle nested json if features is given.
Describe the bug
I have a json named temp.json.
{"ref1": "ABC", "ref2": "DEF", "cuts":[{"cut1": 3, "cut2": 5}]}
I want to load it.
ds = datasets.load_dataset('json', data_files="./temp.json", features=datasets.Features({
'ref1': datasets.Value('string'),
'ref2': datasets.Value('string'),
'cuts': datasets.Sequence({
"cut1": datasets.Value("uint16"),
"cut2": datasets.Value("uint16")
})
}))
The above code does not work. However, I can load it without giving features.
ds = datasets.load_dataset('json', data_files="./temp.json")
Is it possible to load integers as uint16 to save some memory?
Steps to reproduce the bug
As in the bug description.
Expected behavior
The data are loaded and integers are uint16.
Environment info
Copy-and-paste the text below in your GitHub issue.
-
datasetsversion: 2.21.0 - Platform: Linux-5.15.0-118-generic-x86_64-with-glibc2.35
- Python version: 3.11.9
-
huggingface_hubversion: 0.24.5 - PyArrow version: 17.0.0
- Pandas version: 2.2.2
-
fsspecversion: 2024.5.0
Hi ! Sequence has a weird behavior for dictionaries (from tensorflow-datasets), use a regular list instead:
ds = datasets.load_dataset('json', data_files="./temp.json", features=datasets.Features({
'ref1': datasets.Value('string'),
'ref2': datasets.Value('string'),
'cuts': [{
"cut1": datasets.Value("uint16"),
"cut2": datasets.Value("uint16")
}]
}))
Hi !
Sequencehas a weird behavior for dictionaries (from tensorflow-datasets), use a regular list instead:ds = datasets.load_dataset('json', data_files="./temp.json", features=datasets.Features({ 'ref1': datasets.Value('string'), 'ref2': datasets.Value('string'), 'cuts': [{ "cut1": datasets.Value("uint16"), "cut2": datasets.Value("uint16") }] }))
Thank you!
It works.