MindSmallReranking dataset errors out due to download/parsing error?
Tested with the latest mteb on pypi from 4 hours ago as well as installation from latest git main branch:
## Evaluating 1 tasks:
───────────────────────────────────────────────── Selected tasks ─────────────────────────────────────────────────
Reranking
- MindSmallReranking, s2s
INFO:mteb.evaluation.MTEB:
********************** Evaluating MindSmallReranking **********************
INFO:mteb.evaluation.MTEB:Loading dataset for MindSmallReranking
Repo card metadata block was not found. Setting CardData to empty.
WARNING:huggingface_hub.repocard:Repo card metadata block was not found. Setting CardData to empty.
Failed to read file 'gzip://7a742da40ba0425a72301598ce27d63296c468da48cd98c4ae479b1d88a755a8::/root/.cache/huggingface/datasets/downloads/7a742da40ba0425a72301598ce27d63296c468da48cd98c4ae479b1d88a755a8' with error <class 'pyarrow.lib.ArrowInvalid'>: JSON parse error: Invalid value. in row 0
ERROR:datasets.packaged_modules.json.json:Failed to read file 'gzip://7a742da40ba0425a72301598ce27d63296c468da48cd98c4ae479b1d88a755a8::/root/.cache/huggingface/datasets/downloads/7a742da40ba0425a72301598ce27d63296c468da48cd98c4ae479b1d88a755a8' with error <class 'pyarrow.lib.ArrowInvalid'>: JSON parse error: Invalid value. in row 0
ERROR:mteb.evaluation.MTEB:Error while evaluating MindSmallReranking: An error occurred while generating the dataset
---------------------------------------------------------------------------
JSONDecodeError Traceback (most recent call last)
File /opt/conda/lib/python3.10/site-packages/datasets/packaged_modules/json/json.py:145, in Json._generate_tables(self, files)
142 with open(
143 file, encoding=self.config.encoding, errors=self.config.encoding_errors
144 ) as f:
--> 145 dataset = json.load(f)
146 except json.JSONDecodeError:
File /opt/conda/lib/python3.10/json/__init__.py:293, in load(fp, cls, object_hook, parse_float, parse_int, parse_constant, object_pairs_hook, **kw)
276 """Deserialize ``fp`` (a ``.read()``-supporting file-like object containing
277 a JSON document) to a Python object.
278
(...)
291 kwarg; otherwise ``JSONDecoder`` is used.
292 """
--> 293 return loads(fp.read(),
294 cls=cls, object_hook=object_hook,
295 parse_float=parse_float, parse_int=parse_int,
296 parse_constant=parse_constant, object_pairs_hook=object_pairs_hook, **kw)
File /opt/conda/lib/python3.10/json/__init__.py:346, in loads(s, cls, object_hook, parse_float, parse_int, parse_constant, object_pairs_hook, **kw)
343 if (cls is None and object_hook is None and
344 parse_int is None and parse_float is None and
345 parse_constant is None and object_pairs_hook is None and not kw):
--> 346 return _default_decoder.decode(s)
347 if cls is None:
File /opt/conda/lib/python3.10/json/decoder.py:337, in JSONDecoder.decode(self, s, _w)
333 """Return the Python representation of ``s`` (a ``str`` instance
334 containing a JSON document).
335
336 """
--> 337 obj, end = self.raw_decode(s, idx=_w(s, 0).end())
338 end = _w(s, end).end()
File /opt/conda/lib/python3.10/json/decoder.py:355, in JSONDecoder.raw_decode(self, s, idx)
354 except StopIteration as err:
--> 355 raise JSONDecodeError("Expecting value", s, err.value) from None
356 return obj, end
JSONDecodeError: Expecting value: line 1 column 1 (char 0)
During handling of the above exception, another exception occurred:
ArrowInvalid Traceback (most recent call last)
File /opt/conda/lib/python3.10/site-packages/datasets/builder.py:1995, in ArrowBasedBuilder._prepare_split_single(self, gen_kwargs, fpath, file_format, max_shard_size, job_id)
1994 _time = time.time()
-> 1995 for _, table in generator:
1996 if max_shard_size is not None and writer._num_bytes > max_shard_size:
File /opt/conda/lib/python3.10/site-packages/datasets/packaged_modules/json/json.py:148, in Json._generate_tables(self, files)
147 logger.error(f"Failed to read file '{file}' with error {type(e)}: {e}")
--> 148 raise e
149 # If possible, parse the file as a list of json objects/strings and exit the loop
File /opt/conda/lib/python3.10/site-packages/datasets/packaged_modules/json/json.py:122, in Json._generate_tables(self, files)
121 try:
--> 122 pa_table = paj.read_json(
123 io.BytesIO(batch), read_options=paj.ReadOptions(block_size=block_size)
124 )
125 break
File /opt/conda/lib/python3.10/site-packages/pyarrow/_json.pyx:308, in pyarrow._json.read_json()
File /opt/conda/lib/python3.10/site-packages/pyarrow/error.pxi:154, in pyarrow.lib.pyarrow_internal_check_status()
File /opt/conda/lib/python3.10/site-packages/pyarrow/error.pxi:91, in pyarrow.lib.check_status()
ArrowInvalid: JSON parse error: Invalid value. in row 0
The above exception was the direct cause of the following exception:
DatasetGenerationError Traceback (most recent call last)
Cell In[2], line 118
114 eval_splits = ["dev"] if task == "MSMARCO" else ["test"]
115 evaluation = MTEB(
116 tasks=[task], task_langs=["en"]
117 ) # Remove "en" for running all languages
--> 118 evaluation.run(
119 model, output_folder=f"results/{model_name}", eval_splits=eval_splits
120 )
File /opt/conda/lib/python3.10/site-packages/mteb/evaluation/MTEB.py:356, in MTEB.run(self, model, verbosity, output_folder, eval_splits, overwrite_results, raise_error, co2_tracker, **kwargs)
352 logger.error(
353 f"Error while evaluating {task.metadata_dict['name']}: {e}"
354 )
355 if raise_error:
--> 356 raise e
357 logger.error(
358 f"Please check all the error logs at: {self.err_logs_path}"
359 )
360 with open(self.err_logs_path, "a") as f_out:
File /opt/conda/lib/python3.10/site-packages/mteb/evaluation/MTEB.py:301, in MTEB.run(self, model, verbosity, output_folder, eval_splits, overwrite_results, raise_error, co2_tracker, **kwargs)
299 logger.info(f"Loading dataset for {task.metadata_dict['name']}")
300 task.check_if_dataset_is_superseeded()
--> 301 task.load_data(eval_splits=task_eval_splits, **kwargs)
303 # run evaluation
304 task_results = {
305 "mteb_version": version("mteb"), # noqa: F405
306 "dataset_revision": task.metadata_dict["dataset"].get(
(...)
309 "mteb_dataset_name": task.metadata_dict["name"],
310 }
File /opt/conda/lib/python3.10/site-packages/mteb/abstasks/AbsTask.py:85, in AbsTask.load_data(self, **kwargs)
83 if self.data_loaded:
84 return
---> 85 self.dataset = datasets.load_dataset(**self.metadata_dict["dataset"]) # type: ignore
86 self.dataset_transform()
87 self.data_loaded = True
File /opt/conda/lib/python3.10/site-packages/datasets/load.py:2609, in load_dataset(path, name, data_dir, data_files, split, cache_dir, features, download_config, download_mode, verification_mode, ignore_verifications, keep_in_memory, save_infos, revision, token, use_auth_token, task, streaming, num_proc, storage_options, trust_remote_code, **config_kwargs)
2606 return builder_instance.as_streaming_dataset(split=split)
2608 # Download and prepare data
-> 2609 builder_instance.download_and_prepare(
2610 download_config=download_config,
2611 download_mode=download_mode,
2612 verification_mode=verification_mode,
2613 num_proc=num_proc,
2614 storage_options=storage_options,
2615 )
2617 # Build dataset for splits
2618 keep_in_memory = (
2619 keep_in_memory if keep_in_memory is not None else is_small_dataset(builder_instance.info.dataset_size)
2620 )
File /opt/conda/lib/python3.10/site-packages/datasets/builder.py:1027, in DatasetBuilder.download_and_prepare(self, output_dir, download_config, download_mode, verification_mode, ignore_verifications, try_from_hf_gcs, dl_manager, base_path, use_auth_token, file_format, max_shard_size, num_proc, storage_options, **download_and_prepare_kwargs)
1025 if num_proc is not None:
1026 prepare_split_kwargs["num_proc"] = num_proc
-> 1027 self._download_and_prepare(
1028 dl_manager=dl_manager,
1029 verification_mode=verification_mode,
1030 **prepare_split_kwargs,
1031 **download_and_prepare_kwargs,
1032 )
1033 # Sync info
1034 self.info.dataset_size = sum(split.num_bytes for split in self.info.splits.values())
File /opt/conda/lib/python3.10/site-packages/datasets/builder.py:1122, in DatasetBuilder._download_and_prepare(self, dl_manager, verification_mode, **prepare_split_kwargs)
1118 split_dict.add(split_generator.split_info)
1120 try:
1121 # Prepare split will record examples associated to the split
-> 1122 self._prepare_split(split_generator, **prepare_split_kwargs)
1123 except OSError as e:
1124 raise OSError(
1125 "Cannot find data file. "
1126 + (self.manual_download_instructions or "")
1127 + "\nOriginal error:\n"
1128 + str(e)
1129 ) from None
File /opt/conda/lib/python3.10/site-packages/datasets/builder.py:1882, in ArrowBasedBuilder._prepare_split(self, split_generator, file_format, num_proc, max_shard_size)
1880 job_id = 0
1881 with pbar:
-> 1882 for job_id, done, content in self._prepare_split_single(
1883 gen_kwargs=gen_kwargs, job_id=job_id, **_prepare_split_args
1884 ):
1885 if done:
1886 result = content
File /opt/conda/lib/python3.10/site-packages/datasets/builder.py:2038, in ArrowBasedBuilder._prepare_split_single(self, gen_kwargs, fpath, file_format, max_shard_size, job_id)
2036 if isinstance(e, DatasetGenerationError):
2037 raise
-> 2038 raise DatasetGenerationError("An error occurred while generating the dataset") from e
2040 yield job_id, True, (total_num_examples, total_num_bytes, writer._features, num_shards, shard_lengths)
DatasetGenerationError: An error occurred while generating the dataset
Also tried deleting the huggingface .cache downloads and mteb folder for MindSmallReranking
Hello!
Thank you for reporting this bug, we're investigating and will be back soon with a fix.
So it seems to be a problem with the datasets library, we upgraded to 2.19 recently and it's causing this error. I downgraded it to 2.17 and 2.18 and it worked.
I'm trying to find a better solution than downgrading as we need at least the 2.19 version for other tasks.
@loicmagne should we report this error to datasets maintainers?
The only solution I see atm is to change mind_small format on our end to parquet or anything else not using jsonl. @Muennighoff any thoughts?
I can confirm datasets==2.18 is working
@loicmagne should we report this error to
datasetsmaintainers?The only solution I see atm is to change
mind_smallformat on our end to parquet or anything else not using jsonl. @Muennighoff any thoughts?
I don't know why it broke now, we can definitely open an issue on the hf repo to notify them
It looks like the error comes from the .gz format, I tried to convert to .zip and it works correctly: https://huggingface.co/datasets/loicmagne/mind_small
I think that can do it for now @imenelydiaker ?
Marking this as completed as the PR is merged.