datasets icon indicating copy to clipboard operation
datasets copied to clipboard

Error loading StonyBrookNLP/tellmewhy dataset from hub even though local copy loads correctly

Open ykl7 opened this issue 3 years ago • 0 comments

Describe the bug

I have added a new dataset with the identifier StonyBrookNLP/tellmewhy to the hub. When I load the individual files using my local copy using dataset = datasets.load_dataset("json", data_files="data/train.jsonl"), it loads the dataset correctly. However, when I try to load it from the hub, I get an error (pasted below).

Steps to reproduce the bug

dataset = datasets.load_dataset('StonyBrookNLP/tellmewhy')

Expected results

Successfully load the StonyBrookNLP/tellmewhy dataset.

Actual results

Using custom data configuration StonyBrookNLP--tellmewhy-82712924092694ff
Downloading and preparing dataset json/StonyBrookNLP--tellmewhy to /home/yklal95/.cache/huggingface/datasets/StonyBrookNLP___json/StonyBrookNLP--tellmewhy-82712924092694ff/0.0.0/a3e658c4731e59120d44081ac10bf85dc7e1388126b92338344ce9661907f253...
Downloading data files: 100%|██████████████████████████████| 3/3 [00:00<00:00, 957.46it/s]
Extracting data files: 100%|███████████████████████████████| 3/3 [00:00<00:00, 299.14it/s]
Traceback (most recent call last):
  File "/home/yklal95/tmw-generalization/src/load_datasets.py", line 17, in <module>
    main(args)
  File "/home/yklal95/tmw-generalization/src/load_datasets.py", line 11, in main
    dataset = datasets.load_dataset(args.dataset_name)
  File "/home/yklal95/anaconda3/envs/tmw-generalization/lib/python3.9/site-packages/datasets/load.py", line 1746, in load_dataset
    builder_instance.download_and_prepare(
  File "/home/yklal95/anaconda3/envs/tmw-generalization/lib/python3.9/site-packages/datasets/builder.py", line 704, in download_and_prepare
    self._download_and_prepare(
  File "/home/yklal95/anaconda3/envs/tmw-generalization/lib/python3.9/site-packages/datasets/builder.py", line 793, in _download_and_prepare
    self._prepare_split(split_generator, **prepare_split_kwargs)
  File "/home/yklal95/anaconda3/envs/tmw-generalization/lib/python3.9/site-packages/datasets/builder.py", line 1277, in _prepare_split
    writer.write_table(table)
  File "/home/yklal95/anaconda3/envs/tmw-generalization/lib/python3.9/site-packages/datasets/arrow_writer.py", line 524, in write_table
    pa_table = table_cast(pa_table, self._schema)
  File "/home/yklal95/anaconda3/envs/tmw-generalization/lib/python3.9/site-packages/datasets/table.py", line 2005, in table_cast
    return cast_table_to_schema(table, schema)
  File "/home/yklal95/anaconda3/envs/tmw-generalization/lib/python3.9/site-packages/datasets/table.py", line 1969, in cast_table_to_schema
    arrays = [cast_array_to_feature(table[name], feature) for name, feature in features.items()]
  File "/home/yklal95/anaconda3/envs/tmw-generalization/lib/python3.9/site-packages/datasets/table.py", line 1969, in <listcomp>
    arrays = [cast_array_to_feature(table[name], feature) for name, feature in features.items()]
  File "/home/yklal95/anaconda3/envs/tmw-generalization/lib/python3.9/site-packages/datasets/table.py", line 1681, in wrapper
    return pa.chunked_array([func(chunk, *args, **kwargs) for chunk in array.chunks])
  File "/home/yklal95/anaconda3/envs/tmw-generalization/lib/python3.9/site-packages/datasets/table.py", line 1681, in <listcomp>
    return pa.chunked_array([func(chunk, *args, **kwargs) for chunk in array.chunks])
  File "/home/yklal95/anaconda3/envs/tmw-generalization/lib/python3.9/site-packages/datasets/table.py", line 1822, in cast_array_to_feature
    casted_values = _c(array.values, feature.feature)
  File "/home/yklal95/anaconda3/envs/tmw-generalization/lib/python3.9/site-packages/datasets/table.py", line 1683, in wrapper
    return func(array, *args, **kwargs)
  File "/home/yklal95/anaconda3/envs/tmw-generalization/lib/python3.9/site-packages/datasets/table.py", line 1853, in cast_array_to_feature
    return array_cast(array, feature(), allow_number_to_str=allow_number_to_str)
  File "/home/yklal95/anaconda3/envs/tmw-generalization/lib/python3.9/site-packages/datasets/table.py", line 1683, in wrapper
    return func(array, *args, **kwargs)
  File "/home/yklal95/anaconda3/envs/tmw-generalization/lib/python3.9/site-packages/datasets/table.py", line 1761, in array_cast
    raise TypeError(f"Couldn't cast array of type {array.type} to {pa_type}")
TypeError: Couldn't cast array of type int64 to null

Environment info

  • datasets version: 2.4.0
  • Platform: Linux-4.15.0-121-generic-x86_64-with-glibc2.27
  • Python version: 3.9.13
  • PyArrow version: 9.0.0
  • Pandas version: 1.5.0

ykl7 avatar Sep 21 '22 16:09 ykl7