Identical keywords in build_kwargs and config_kwargs lead to TypeError in load_dataset_builder()
Describe the bug
In load_dataset_builder(), build_kwargs and config_kwargs can contain the same keywords leading to a TypeError("type object got multiple values for keyword argument "xyz").
I ran into this problem with the keyword: base_path. It might happen with other kwargs as well. I think a quickfix would be
builder_cls = import_main_class(dataset_module.module_path)
builder_kwargs = dataset_module.builder_kwargs
data_files = builder_kwargs.pop("data_files", data_files)
config_name = builder_kwargs.pop("config_name", name)
hash = builder_kwargs.pop("hash")
base_path = builder_kwargs.pop("base_path")
and then pass base_path into builder_cls.
Steps to reproduce the bug
from datasets import load_dataset
load_dataset("rotten_tomatoes", base_path="./sample_data")
Expected results
The docs state: **config_kwargs — Keyword arguments to be passed to the BuilderConfig and used in the DatasetBuilder.
So I would expect to be able to pass the base_path into load_dataset().
Actual results
TypeError("type object got multiple values for keyword argument "base_path").
Environment info
-
datasetsversion: 2.4.0 - Platform: macOS-12.5-arm64-arm-64bit
- Python version: 3.8.9
- PyArrow version: 9.0.0
I am getting similar error - TypeError: type object got multiple values for keyword argument 'name' while following this tutorial. I am getting this error with the dataset-cli test command.
datasets version: 2.4.0
In my case, this was happening because I defined multiple BuilderConfig for multiple types, but didn't had all the data files that are requierd by those configs.
I think this is different than the original issue by @bablf .
Hi ! I think this can be fixed by letting the config_kwargs take over the builder kwargs here:
https://github.com/huggingface/datasets/blob/7feeb5648a63b6135a8259dedc3b1e19185ee4c7/src/datasets/load.py#L1533-L1534
maybe something like this ?
**{**builder_kwargs, **config_kwargs}
Let me know if you'd like to contribute and fix this bug, so I can assign you :)
In my case, this was happening because I defined multiple BuilderConfig for multiple types, but didn't had all the data files that are requierd by those configs.
I think this is different than the original issue by @bablf .
Feel free to to open an new issue, I'd be happy to help
@lhoestq Yeah, I want to, please assign.
Cool thank you ! Let me know if you have questions or if I can help
@lhoestq On second thoughts, I think this might be expected behavior; although a better error message might help.
Reasoning: Given n configs, if no data file is provided for any config, then it should be an error. Then why it should not be the case if out of n configs, for some data files are provided but not for others. Also, I was using --all_configs flag with dataset-cli test.
Ok I see - maybe we should check the values of builder_kwargs raise an error if any key in config_kwargs tries to overwrite it ? The builder kwargs are determined from the builder's type and location (in some cases it forces the base_path, data_files and config name for example)