DatasetInfo issue when testing multiple configs: mixed task_templates
Describe the bug
When running the datasets-cli test it would seem that some config properties in a DatasetInfo get mangled, leading to issues, e.g., about the ClassLabel.
Steps to reproduce the bug
In summary, what I want to do is create three configs:
- unfiltered: no classlabel, no tasks. Gets data from unfiltered.json.gz (I'd want this without splits, just one chunk of data, but that does not seem possible?)
- filtered_sentiment:
review_sentimentas ClassLabel, TextClassification task withreview_sentimentas label. Gets train/test split from respective json.gz files - filtered_rating:
review_rating0as ClassLabel, TextClassification task withreview_rating0as label. Gets train/test split from respective json.gz files
This might be a bit tedious to reproduce, so I am sorry, but these are the steps:
- Clone datasets ->
datasets/and install it - Clone
https://huggingface.co/datasets/BramVanroy/hebban-reviewsintodatasets/datasetsso that you have a new folderdatasets/datasets/hebban-reviews/. - Replace the HebbanReviews class with this new one:
class HebbanReviews(datasets.GeneratorBasedBuilder):
"""The Hebban book reviews dataset."""
BUILDER_CONFIGS = [
HebbanReviewsConfig(
name="unfiltered",
description=_HEBBAN_REVIEWS_UNFILTERED_DESCRIPTION,
version=datasets.Version(_HEBBAN_VERSION)
),
HebbanReviewsConfig(
name="filtered_sentiment",
description=f"This config has the negative, neutral, and positive sentiment scores as ClassLabel in the 'review_sentiment' column.\n{_HEBBAN_REVIEWS_FILTERED_DESCRIPTION}",
version=datasets.Version(_HEBBAN_VERSION)
),
HebbanReviewsConfig(
name="filtered_rating",
description=f"This config has the 5-class ratings as ClassLabel in the 'review_rating0' column (which is a variant of 'review_rating' that starts counting from 0 instead of 1).\n{_HEBBAN_REVIEWS_FILTERED_DESCRIPTION}",
version=datasets.Version(_HEBBAN_VERSION)
)
]
DEFAULT_CONFIG_NAME = "filtered_sentiment"
_URLS = {
"train": "train.jsonl.gz",
"test": "test.jsonl.gz",
"unfiltered": "unfiltered.jsonl.gz",
}
def _info(self):
features = {
"review_title": datasets.Value("string"),
"review_text": datasets.Value("string"),
"review_text_without_quotes": datasets.Value("string"),
"review_n_quotes": datasets.Value("int32"),
"review_n_tokens": datasets.Value("int32"),
"review_rating": datasets.Value("int32"),
"review_rating0": datasets.Value("int32"),
"review_author_url": datasets.Value("string"),
"review_author_type": datasets.Value("string"),
"review_n_likes": datasets.Value("int32"),
"review_n_comments": datasets.Value("int32"),
"review_url": datasets.Value("string"),
"review_published_date": datasets.Value("string"),
"review_crawl_date": datasets.Value("string"),
"lid": datasets.Value("string"),
"lid_probability": datasets.Value("float32"),
"review_sentiment": datasets.features.ClassLabel(names=["negative", "neutral", "positive"]),
"review_sentiment_label": datasets.Value("string"),
"book_id": datasets.Value("int32"),
}
if self.config.name == "filtered_sentiment":
task_templates = [datasets.TextClassification(text_column="review_text_without_quotes", label_column="review_sentiment")]
elif self.config.name == "filtered_rating":
# For CrossEntropy, our classes need to start at index 0 -- not 1
features["review_rating0"] = datasets.features.ClassLabel(names=["1", "2", "3", "4", "5"])
features["review_sentiment"] = datasets.Value("int32")
task_templates = [datasets.TextClassification(text_column="review_text_without_quotes", label_column="review_rating0")]
elif self.config.name == "unfiltered": # no ClassLabels in unfiltered
features["review_sentiment"] = datasets.Value("int32")
task_templates = None
else:
raise ValueError(f"Unsupported config {self.config.name}. Expected one of 'filtered_sentiment' (default),"
f" 'filtered_rating', or 'unfiltered'")
print("AT INFO", self.config.name, task_templates)
return datasets.DatasetInfo(
description=self.config.description,
features=datasets.Features(features),
homepage="https://huggingface.co/datasets/BramVanroy/hebban-reviews",
citation=_HEBBAN_REVIEWS_CITATION,
task_templates=task_templates,
license="cc-by-4.0"
)
def _split_generators(self, dl_manager):
if self.config.name.startswith("filtered"):
files = dl_manager.download_and_extract({"train": "train.jsonl.gz",
"test": "test.jsonl.gz"})
return [
datasets.SplitGenerator(
name=datasets.Split.TRAIN,
gen_kwargs={
"data_file": files["train"]
},
),
datasets.SplitGenerator(
name=datasets.Split.TEST,
gen_kwargs={
"data_file": files["test"]
},
),
]
elif self.config.name == "unfiltered":
files = dl_manager.download_and_extract({"train": "unfiltered.jsonl.gz"})
return [
datasets.SplitGenerator(
name=datasets.Split.TRAIN,
gen_kwargs={
"data_file": files["train"]
},
),
]
else:
raise ValueError(f"Unsupported config {self.config.name}. Expected one of 'filtered_sentiment' (default),"
f" 'filtered_rating', or 'unfiltered'")
def _generate_examples(self, data_file):
lines = Path(data_file).open(encoding="utf-8").readlines()
for line_idx, line in enumerate(lines):
row = json.loads(line)
yield line_idx, row
- finally, run
datasets-cli test ./datasets/hebban-reviews/ --save_infos --all_configsfrom within the topmostdatasetsdirectory
Expected results
Succeeding tests for three different configs.
Actual results
I printed out the values that are given to DatasetInfo for config name and task_templates, as you can see. There, as expected, I get unfiltered None. I also modified datasets/info.py and added this line at L.170:
print("INTERNALLY AT INFO.PY", self.config_name, self.task_templates)
to my surprise, here I get unfiltered [TextClassification(task='text-classification', text_column='review_text_without_quotes', label_column='review_sentiment')]. So one way or another, here I suddenly see that unfiltered now does have a task_template -- even though that is not what is written in the data loading script, as the first print statement correctly shows.
I do not quite understand how, but it seems that the config name and task_templates get mixed.
This ultimately leads to the following error, but this trace may not be very useful in itself:
Traceback (most recent call last):
File "C:\Users\bramv\.virtualenvs\hebban-U6poXNQd\Scripts\datasets-cli-script.py", line 33, in <module>
sys.exit(load_entry_point('datasets', 'console_scripts', 'datasets-cli')())
File "c:\dev\python\hebban\datasets\src\datasets\commands\datasets_cli.py", line 39, in main
service.run()
File "c:\dev\python\hebban\datasets\src\datasets\commands\test.py", line 144, in run
builder.as_dataset()
File "c:\dev\python\hebban\datasets\src\datasets\builder.py", line 899, in as_dataset
datasets = map_nested(
File "c:\dev\python\hebban\datasets\src\datasets\utils\py_utils.py", line 393, in map_nested
mapped = [
File "c:\dev\python\hebban\datasets\src\datasets\utils\py_utils.py", line 394, in <listcomp>
_single_map_nested((function, obj, types, None, True, None))
File "c:\dev\python\hebban\datasets\src\datasets\utils\py_utils.py", line 330, in _single_map_nested
return function(data_struct)
File "c:\dev\python\hebban\datasets\src\datasets\builder.py", line 930, in _build_single_dataset
ds = self._as_dataset(
File "c:\dev\python\hebban\datasets\src\datasets\builder.py", line 1006, in _as_dataset
return Dataset(fingerprint=fingerprint, **dataset_kwargs)
File "c:\dev\python\hebban\datasets\src\datasets\arrow_dataset.py", line 661, in __init__
info = info.copy() if info is not None else DatasetInfo()
File "c:\dev\python\hebban\datasets\src\datasets\info.py", line 286, in copy
return self.__class__(**{k: copy.deepcopy(v) for k, v in self.__dict__.items()})
File "<string>", line 20, in __init__
File "c:\dev\python\hebban\datasets\src\datasets\info.py", line 176, in __post_init__
self.task_templates = [
File "c:\dev\python\hebban\datasets\src\datasets\info.py", line 177, in <listcomp>
template.align_with_features(self.features) for template in (self.task_templates)
File "c:\dev\python\hebban\datasets\src\datasets\tasks\text_classification.py", line 22, in align_with_features
raise ValueError(f"Column {self.label_column} is not a ClassLabel.")
ValueError: Column review_sentiment is not a ClassLabel.
Environment info
-
datasetsversion: 2.4.1.dev0 - Platform: Windows-10-10.0.19041-SP0
- Python version: 3.8.8
- PyArrow version: 8.0.0
- Pandas version: 1.4.3
I've narrowed down the issue to the dataset_module_factory which already creates a dataset_infos.json file down in the .cache/modules/dataset_modules/.. folder. That JSON file already contains the wrong task_templates for unfiltered.
Ugh. Found the issue: apparently datasets was reusing the already existing dataset_infos.json that is inside datasets/datasets/hebban-reviews! Is this desired behavior?
Perhaps when --save_infos and --all_configs are given, an existing dataset_infos.json file should first be deleted before continuing with the test? Because that would assume that the user wants to create a new infos file for all configs anyway.
Hi! I think this is a reasonable solution. Would you be interested in submitting a PR?