Cannot load dataset, fails with nested data conversions not implemented for chunked array outputs
Describe the bug
Hi! When I load this dataset, it fails with a pyarrow error. I'm using datasets 4.1.1, though I also see this with datasets 4.1.2
To reproduce:
import datasets
ds = datasets.load_dataset(path="metr-evals/malt-public", name="irrelevant_detail")
Error:
Traceback (most recent call last):
File "/Users/neev/scratch/.venv/lib/python3.13/site-packages/datasets/builder.py", line 1815, in _prepare_split_single
for _, table in generator:
^^^^^^^^^
File "/Users/neev/scratch/.venv/lib/python3.13/site-packages/datasets/packaged_modules/parquet/parquet.py", line 93, in _generate_tables
for batch_idx, record_batch in enumerate(
~~~~~~~~~^
parquet_fragment.to_batches(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
...<5 lines>...
)
^
):
^
File "pyarrow/_dataset.pyx", line 3904, in _iterator
File "pyarrow/_dataset.pyx", line 3494, in pyarrow._dataset.TaggedRecordBatchIterator.__next__
File "pyarrow/error.pxi", line 155, in pyarrow.lib.pyarrow_internal_check_status
File "pyarrow/error.pxi", line 92, in pyarrow.lib.check_status
pyarrow.lib.ArrowNotImplementedError: Nested data conversions not implemented for chunked array outputs
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/Users/neev/scratch/test_hf.py", line 3, in <module>
ds = datasets.load_dataset(path="metr-evals/malt-public", name="irrelevant_detail")
File "/Users/neev/scratch/.venv/lib/python3.13/site-packages/datasets/load.py", line 1412, in load_dataset
builder_instance.download_and_prepare(
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^
download_config=download_config,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
...<3 lines>...
storage_options=storage_options,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
)
^
File "/Users/neev/scratch/.venv/lib/python3.13/site-packages/datasets/builder.py", line 894, in download_and_prepare
self._download_and_prepare(
~~~~~~~~~~~~~~~~~~~~~~~~~~^
dl_manager=dl_manager,
^^^^^^^^^^^^^^^^^^^^^^
...<2 lines>...
**download_and_prepare_kwargs,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
)
^
File "/Users/neev/scratch/.venv/lib/python3.13/site-packages/datasets/builder.py", line 970, in _download_and_prepare
self._prepare_split(split_generator, **prepare_split_kwargs)
~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/neev/scratch/.venv/lib/python3.13/site-packages/datasets/builder.py", line 1702, in _prepare_split
for job_id, done, content in self._prepare_split_single(
~~~~~~~~~~~~~~~~~~~~~~~~~~^
gen_kwargs=gen_kwargs, job_id=job_id, **_prepare_split_args
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
):
^
File "/Users/neev/scratch/.venv/lib/python3.13/site-packages/datasets/builder.py", line 1858, in _prepare_split_single
raise DatasetGenerationError("An error occurred while generating the dataset") from e
datasets.exceptions.DatasetGenerationError: An error occurred while generating the dataset
Steps to reproduce the bug
To reproduce:
import datasets
ds = datasets.load_dataset(path="metr-evals/malt-public", name="irrelevant_detail")
Expected behavior
The dataset loads
Environment info
Datasets: 4.1.1 Python: 3.13 Platform: Macos
Hey @neevparikh, Thanks for reporting this! I can reproduce the issue and have identified the root cause. Problem: The metr-evals/malt-public dataset contains deeply nested conversation data that exceeds PyArrow's 16MB chunk limit. When PyArrow tries to read it in chunks, it hits a fundamental limitation: "Nested data conversions not implemented for chunked array outputs". Root Cause: Your dataset has large nested arrays (conversation trees with 4k-87k elements) that get automatically chunked by PyArrow, but the nested data conversion logic can't handle repetition levels across chunk boundaries I'm preparing a PR that adds a fallback mechanism to the parquet reader. When this specific error occurs, it will:
Detect the nested data issue Combine chunks selectively for problematic columns Continue processing normally
This maintains backward compatibility while fixing the issue for nested datasets like yours. Workaround (if you need immediate access): Try loading with smaller batch sizes: pythonds = datasets.load_dataset("metr-evals/malt-public", name="irrelevant_detail", download_config=datasets.DownloadConfig( parquet_batch_size=1000 ))