datasets
datasets copied to clipboard
Error loading StonyBrookNLP/tellmewhy dataset from hub even though local copy loads correctly
Describe the bug
I have added a new dataset with the identifier StonyBrookNLP/tellmewhy to the hub. When I load the individual files using my local copy using dataset = datasets.load_dataset("json", data_files="data/train.jsonl"), it loads the dataset correctly. However, when I try to load it from the hub, I get an error (pasted below).
Steps to reproduce the bug
dataset = datasets.load_dataset('StonyBrookNLP/tellmewhy')
Expected results
Successfully load the StonyBrookNLP/tellmewhy dataset.
Actual results
Using custom data configuration StonyBrookNLP--tellmewhy-82712924092694ff
Downloading and preparing dataset json/StonyBrookNLP--tellmewhy to /home/yklal95/.cache/huggingface/datasets/StonyBrookNLP___json/StonyBrookNLP--tellmewhy-82712924092694ff/0.0.0/a3e658c4731e59120d44081ac10bf85dc7e1388126b92338344ce9661907f253...
Downloading data files: 100%|██████████████████████████████| 3/3 [00:00<00:00, 957.46it/s]
Extracting data files: 100%|███████████████████████████████| 3/3 [00:00<00:00, 299.14it/s]
Traceback (most recent call last):
File "/home/yklal95/tmw-generalization/src/load_datasets.py", line 17, in <module>
main(args)
File "/home/yklal95/tmw-generalization/src/load_datasets.py", line 11, in main
dataset = datasets.load_dataset(args.dataset_name)
File "/home/yklal95/anaconda3/envs/tmw-generalization/lib/python3.9/site-packages/datasets/load.py", line 1746, in load_dataset
builder_instance.download_and_prepare(
File "/home/yklal95/anaconda3/envs/tmw-generalization/lib/python3.9/site-packages/datasets/builder.py", line 704, in download_and_prepare
self._download_and_prepare(
File "/home/yklal95/anaconda3/envs/tmw-generalization/lib/python3.9/site-packages/datasets/builder.py", line 793, in _download_and_prepare
self._prepare_split(split_generator, **prepare_split_kwargs)
File "/home/yklal95/anaconda3/envs/tmw-generalization/lib/python3.9/site-packages/datasets/builder.py", line 1277, in _prepare_split
writer.write_table(table)
File "/home/yklal95/anaconda3/envs/tmw-generalization/lib/python3.9/site-packages/datasets/arrow_writer.py", line 524, in write_table
pa_table = table_cast(pa_table, self._schema)
File "/home/yklal95/anaconda3/envs/tmw-generalization/lib/python3.9/site-packages/datasets/table.py", line 2005, in table_cast
return cast_table_to_schema(table, schema)
File "/home/yklal95/anaconda3/envs/tmw-generalization/lib/python3.9/site-packages/datasets/table.py", line 1969, in cast_table_to_schema
arrays = [cast_array_to_feature(table[name], feature) for name, feature in features.items()]
File "/home/yklal95/anaconda3/envs/tmw-generalization/lib/python3.9/site-packages/datasets/table.py", line 1969, in <listcomp>
arrays = [cast_array_to_feature(table[name], feature) for name, feature in features.items()]
File "/home/yklal95/anaconda3/envs/tmw-generalization/lib/python3.9/site-packages/datasets/table.py", line 1681, in wrapper
return pa.chunked_array([func(chunk, *args, **kwargs) for chunk in array.chunks])
File "/home/yklal95/anaconda3/envs/tmw-generalization/lib/python3.9/site-packages/datasets/table.py", line 1681, in <listcomp>
return pa.chunked_array([func(chunk, *args, **kwargs) for chunk in array.chunks])
File "/home/yklal95/anaconda3/envs/tmw-generalization/lib/python3.9/site-packages/datasets/table.py", line 1822, in cast_array_to_feature
casted_values = _c(array.values, feature.feature)
File "/home/yklal95/anaconda3/envs/tmw-generalization/lib/python3.9/site-packages/datasets/table.py", line 1683, in wrapper
return func(array, *args, **kwargs)
File "/home/yklal95/anaconda3/envs/tmw-generalization/lib/python3.9/site-packages/datasets/table.py", line 1853, in cast_array_to_feature
return array_cast(array, feature(), allow_number_to_str=allow_number_to_str)
File "/home/yklal95/anaconda3/envs/tmw-generalization/lib/python3.9/site-packages/datasets/table.py", line 1683, in wrapper
return func(array, *args, **kwargs)
File "/home/yklal95/anaconda3/envs/tmw-generalization/lib/python3.9/site-packages/datasets/table.py", line 1761, in array_cast
raise TypeError(f"Couldn't cast array of type {array.type} to {pa_type}")
TypeError: Couldn't cast array of type int64 to null
Environment info
-
datasetsversion: 2.4.0 - Platform: Linux-4.15.0-121-generic-x86_64-with-glibc2.27
- Python version: 3.9.13
- PyArrow version: 9.0.0
- Pandas version: 1.5.0