mteb MindSmallReranking dataset errors out due to download/parsing error?

Tested with the latest mteb on pypi from 4 hours ago as well as installation from latest git main branch:

## Evaluating 1 tasks:
───────────────────────────────────────────────── Selected tasks  ─────────────────────────────────────────────────
Reranking
    - MindSmallReranking, s2s

INFO:mteb.evaluation.MTEB:

********************** Evaluating MindSmallReranking **********************
INFO:mteb.evaluation.MTEB:Loading dataset for MindSmallReranking
Repo card metadata block was not found. Setting CardData to empty.
WARNING:huggingface_hub.repocard:Repo card metadata block was not found. Setting CardData to empty.
Failed to read file 'gzip://7a742da40ba0425a72301598ce27d63296c468da48cd98c4ae479b1d88a755a8::/root/.cache/huggingface/datasets/downloads/7a742da40ba0425a72301598ce27d63296c468da48cd98c4ae479b1d88a755a8' with error <class 'pyarrow.lib.ArrowInvalid'>: JSON parse error: Invalid value. in row 0
ERROR:datasets.packaged_modules.json.json:Failed to read file 'gzip://7a742da40ba0425a72301598ce27d63296c468da48cd98c4ae479b1d88a755a8::/root/.cache/huggingface/datasets/downloads/7a742da40ba0425a72301598ce27d63296c468da48cd98c4ae479b1d88a755a8' with error <class 'pyarrow.lib.ArrowInvalid'>: JSON parse error: Invalid value. in row 0
ERROR:mteb.evaluation.MTEB:Error while evaluating MindSmallReranking: An error occurred while generating the dataset
---------------------------------------------------------------------------
JSONDecodeError                           Traceback (most recent call last)
File /opt/conda/lib/python3.10/site-packages/datasets/packaged_modules/json/json.py:145, in Json._generate_tables(self, files)
    142     with open(
    143         file, encoding=self.config.encoding, errors=self.config.encoding_errors
    144     ) as f:
--> 145         dataset = json.load(f)
    146 except json.JSONDecodeError:

File /opt/conda/lib/python3.10/json/__init__.py:293, in load(fp, cls, object_hook, parse_float, parse_int, parse_constant, object_pairs_hook, **kw)
    276 """Deserialize ``fp`` (a ``.read()``-supporting file-like object containing
    277 a JSON document) to a Python object.
    278 
   (...)
    291 kwarg; otherwise ``JSONDecoder`` is used.
    292 """
--> 293 return loads(fp.read(),
    294     cls=cls, object_hook=object_hook,
    295     parse_float=parse_float, parse_int=parse_int,
    296     parse_constant=parse_constant, object_pairs_hook=object_pairs_hook, **kw)

File /opt/conda/lib/python3.10/json/__init__.py:346, in loads(s, cls, object_hook, parse_float, parse_int, parse_constant, object_pairs_hook, **kw)
    343 if (cls is None and object_hook is None and
    344         parse_int is None and parse_float is None and
    345         parse_constant is None and object_pairs_hook is None and not kw):
--> 346     return _default_decoder.decode(s)
    347 if cls is None:

File /opt/conda/lib/python3.10/json/decoder.py:337, in JSONDecoder.decode(self, s, _w)
    333 """Return the Python representation of ``s`` (a ``str`` instance
    334 containing a JSON document).
    335 
    336 """
--> 337 obj, end = self.raw_decode(s, idx=_w(s, 0).end())
    338 end = _w(s, end).end()

File /opt/conda/lib/python3.10/json/decoder.py:355, in JSONDecoder.raw_decode(self, s, idx)
    354 except StopIteration as err:
--> 355     raise JSONDecodeError("Expecting value", s, err.value) from None
    356 return obj, end

JSONDecodeError: Expecting value: line 1 column 1 (char 0)

During handling of the above exception, another exception occurred:

ArrowInvalid                              Traceback (most recent call last)
File /opt/conda/lib/python3.10/site-packages/datasets/builder.py:1995, in ArrowBasedBuilder._prepare_split_single(self, gen_kwargs, fpath, file_format, max_shard_size, job_id)
   1994 _time = time.time()
-> 1995 for _, table in generator:
   1996     if max_shard_size is not None and writer._num_bytes > max_shard_size:

File /opt/conda/lib/python3.10/site-packages/datasets/packaged_modules/json/json.py:148, in Json._generate_tables(self, files)
    147     logger.error(f"Failed to read file '{file}' with error {type(e)}: {e}")
--> 148     raise e
    149 # If possible, parse the file as a list of json objects/strings and exit the loop

File /opt/conda/lib/python3.10/site-packages/datasets/packaged_modules/json/json.py:122, in Json._generate_tables(self, files)
    121 try:
--> 122     pa_table = paj.read_json(
    123         io.BytesIO(batch), read_options=paj.ReadOptions(block_size=block_size)
    124     )
    125     break

File /opt/conda/lib/python3.10/site-packages/pyarrow/_json.pyx:308, in pyarrow._json.read_json()

File /opt/conda/lib/python3.10/site-packages/pyarrow/error.pxi:154, in pyarrow.lib.pyarrow_internal_check_status()

File /opt/conda/lib/python3.10/site-packages/pyarrow/error.pxi:91, in pyarrow.lib.check_status()

ArrowInvalid: JSON parse error: Invalid value. in row 0

The above exception was the direct cause of the following exception:

DatasetGenerationError                    Traceback (most recent call last)
Cell In[2], line 118
    114 eval_splits = ["dev"] if task == "MSMARCO" else ["test"]
    115 evaluation = MTEB(
    116     tasks=[task], task_langs=["en"]
    117 )  # Remove "en" for running all languages
--> 118 evaluation.run(
    119     model, output_folder=f"results/{model_name}", eval_splits=eval_splits
    120 )

File /opt/conda/lib/python3.10/site-packages/mteb/evaluation/MTEB.py:356, in MTEB.run(self, model, verbosity, output_folder, eval_splits, overwrite_results, raise_error, co2_tracker, **kwargs)
    352 logger.error(
    353     f"Error while evaluating {task.metadata_dict['name']}: {e}"
    354 )
    355 if raise_error:
--> 356     raise e
    357 logger.error(
    358     f"Please check all the error logs at: {self.err_logs_path}"
    359 )
    360 with open(self.err_logs_path, "a") as f_out:

File /opt/conda/lib/python3.10/site-packages/mteb/evaluation/MTEB.py:301, in MTEB.run(self, model, verbosity, output_folder, eval_splits, overwrite_results, raise_error, co2_tracker, **kwargs)
    299 logger.info(f"Loading dataset for {task.metadata_dict['name']}")
    300 task.check_if_dataset_is_superseeded()
--> 301 task.load_data(eval_splits=task_eval_splits, **kwargs)
    303 # run evaluation
    304 task_results = {
    305     "mteb_version": version("mteb"),  # noqa: F405
    306     "dataset_revision": task.metadata_dict["dataset"].get(
   (...)
    309     "mteb_dataset_name": task.metadata_dict["name"],
    310 }

File /opt/conda/lib/python3.10/site-packages/mteb/abstasks/AbsTask.py:85, in AbsTask.load_data(self, **kwargs)
     83 if self.data_loaded:
     84     return
---> 85 self.dataset = datasets.load_dataset(**self.metadata_dict["dataset"])  # type: ignore
     86 self.dataset_transform()
     87 self.data_loaded = True

File /opt/conda/lib/python3.10/site-packages/datasets/load.py:2609, in load_dataset(path, name, data_dir, data_files, split, cache_dir, features, download_config, download_mode, verification_mode, ignore_verifications, keep_in_memory, save_infos, revision, token, use_auth_token, task, streaming, num_proc, storage_options, trust_remote_code, **config_kwargs)
   2606     return builder_instance.as_streaming_dataset(split=split)
   2608 # Download and prepare data
-> 2609 builder_instance.download_and_prepare(
   2610     download_config=download_config,
   2611     download_mode=download_mode,
   2612     verification_mode=verification_mode,
   2613     num_proc=num_proc,
   2614     storage_options=storage_options,
   2615 )
   2617 # Build dataset for splits
   2618 keep_in_memory = (
   2619     keep_in_memory if keep_in_memory is not None else is_small_dataset(builder_instance.info.dataset_size)
   2620 )

File /opt/conda/lib/python3.10/site-packages/datasets/builder.py:1027, in DatasetBuilder.download_and_prepare(self, output_dir, download_config, download_mode, verification_mode, ignore_verifications, try_from_hf_gcs, dl_manager, base_path, use_auth_token, file_format, max_shard_size, num_proc, storage_options, **download_and_prepare_kwargs)
   1025     if num_proc is not None:
   1026         prepare_split_kwargs["num_proc"] = num_proc
-> 1027     self._download_and_prepare(
   1028         dl_manager=dl_manager,
   1029         verification_mode=verification_mode,
   1030         **prepare_split_kwargs,
   1031         **download_and_prepare_kwargs,
   1032     )
   1033 # Sync info
   1034 self.info.dataset_size = sum(split.num_bytes for split in self.info.splits.values())

File /opt/conda/lib/python3.10/site-packages/datasets/builder.py:1122, in DatasetBuilder._download_and_prepare(self, dl_manager, verification_mode, **prepare_split_kwargs)
   1118 split_dict.add(split_generator.split_info)
   1120 try:
   1121     # Prepare split will record examples associated to the split
-> 1122     self._prepare_split(split_generator, **prepare_split_kwargs)
   1123 except OSError as e:
   1124     raise OSError(
   1125         "Cannot find data file. "
   1126         + (self.manual_download_instructions or "")
   1127         + "\nOriginal error:\n"
   1128         + str(e)
   1129     ) from None

File /opt/conda/lib/python3.10/site-packages/datasets/builder.py:1882, in ArrowBasedBuilder._prepare_split(self, split_generator, file_format, num_proc, max_shard_size)
   1880 job_id = 0
   1881 with pbar:
-> 1882     for job_id, done, content in self._prepare_split_single(
   1883         gen_kwargs=gen_kwargs, job_id=job_id, **_prepare_split_args
   1884     ):
   1885         if done:
   1886             result = content

File /opt/conda/lib/python3.10/site-packages/datasets/builder.py:2038, in ArrowBasedBuilder._prepare_split_single(self, gen_kwargs, fpath, file_format, max_shard_size, job_id)
   2036     if isinstance(e, DatasetGenerationError):
   2037         raise
-> 2038     raise DatasetGenerationError("An error occurred while generating the dataset") from e
   2040 yield job_id, True, (total_num_examples, total_num_bytes, writer._features, num_shards, shard_lengths)

DatasetGenerationError: An error occurred while generating the dataset

Also tried deleting the huggingface .cache downloads and mteb folder for MindSmallReranking

May 16 '24 18:05 w601sxs

Hello!

Thank you for reporting this bug, we're investigating and will be back soon with a fix.

May 16 '24 18:05 imenelydiaker

So it seems to be a problem with the datasets library, we upgraded to 2.19 recently and it's causing this error. I downgraded it to 2.17 and 2.18 and it worked.

I'm trying to find a better solution than downgrading as we need at least the 2.19 version for other tasks.

May 16 '24 19:05 imenelydiaker

@loicmagne should we report this error to datasets maintainers?

The only solution I see atm is to change mind_small format on our end to parquet or anything else not using jsonl. @Muennighoff any thoughts?

May 16 '24 19:05 imenelydiaker

I can confirm datasets==2.18 is working

May 16 '24 19:05 w601sxs

@loicmagne should we report this error to datasets maintainers?

The only solution I see atm is to change mind_small format on our end to parquet or anything else not using jsonl. @Muennighoff any thoughts?

I don't know why it broke now, we can definitely open an issue on the hf repo to notify them

It looks like the error comes from the .gz format, I tried to convert to .zip and it works correctly: https://huggingface.co/datasets/loicmagne/mind_small

I think that can do it for now @imenelydiaker ?

May 16 '24 20:05 loicmagne

Marking this as completed as the PR is merged.

May 17 '24 07:05 isaac-chung