datasets icon indicating copy to clipboard operation
datasets copied to clipboard

Identical keywords in build_kwargs and config_kwargs lead to TypeError in load_dataset_builder()

Open bablf opened this issue 3 years ago • 9 comments

Describe the bug

In load_dataset_builder(), build_kwargs and config_kwargs can contain the same keywords leading to a TypeError("type object got multiple values for keyword argument "xyz").

I ran into this problem with the keyword: base_path. It might happen with other kwargs as well. I think a quickfix would be

builder_cls = import_main_class(dataset_module.module_path)
builder_kwargs = dataset_module.builder_kwargs
data_files = builder_kwargs.pop("data_files", data_files)
config_name = builder_kwargs.pop("config_name", name)
hash = builder_kwargs.pop("hash")
base_path = builder_kwargs.pop("base_path")

and then pass base_path into builder_cls.

Steps to reproduce the bug

from datasets import load_dataset
load_dataset("rotten_tomatoes", base_path="./sample_data")

Expected results

The docs state: **config_kwargs — Keyword arguments to be passed to the BuilderConfig and used in the DatasetBuilder.

So I would expect to be able to pass the base_path into load_dataset().

Actual results

TypeError("type object got multiple values for keyword argument "base_path").

Environment info

  • datasets version: 2.4.0
  • Platform: macOS-12.5-arm64-arm-64bit
  • Python version: 3.8.9
  • PyArrow version: 9.0.0

bablf avatar Aug 29 '22 14:08 bablf

I am getting similar error - TypeError: type object got multiple values for keyword argument 'name' while following this tutorial. I am getting this error with the dataset-cli test command.

datasets version: 2.4.0

thepurpleowl avatar Aug 29 '22 18:08 thepurpleowl

In my case, this was happening because I defined multiple BuilderConfig for multiple types, but didn't had all the data files that are requierd by those configs.

I think this is different than the original issue by @bablf .

thepurpleowl avatar Sep 01 '22 18:09 thepurpleowl

Hi ! I think this can be fixed by letting the config_kwargs take over the builder kwargs here:

https://github.com/huggingface/datasets/blob/7feeb5648a63b6135a8259dedc3b1e19185ee4c7/src/datasets/load.py#L1533-L1534

maybe something like this ?

    **{**builder_kwargs, **config_kwargs}

Let me know if you'd like to contribute and fix this bug, so I can assign you :)

In my case, this was happening because I defined multiple BuilderConfig for multiple types, but didn't had all the data files that are requierd by those configs.

I think this is different than the original issue by @bablf .

Feel free to to open an new issue, I'd be happy to help

lhoestq avatar Sep 06 '22 10:09 lhoestq

@lhoestq Yeah, I want to, please assign.

thepurpleowl avatar Sep 06 '22 14:09 thepurpleowl

Cool thank you ! Let me know if you have questions or if I can help

lhoestq avatar Sep 06 '22 14:09 lhoestq

@lhoestq On second thoughts, I think this might be expected behavior; although a better error message might help.

Reasoning: Given n configs, if no data file is provided for any config, then it should be an error. Then why it should not be the case if out of n configs, for some data files are provided but not for others. Also, I was using --all_configs flag with dataset-cli test.

thepurpleowl avatar Sep 12 '22 08:09 thepurpleowl

Ok I see - maybe we should check the values of builder_kwargs raise an error if any key in config_kwargs tries to overwrite it ? The builder kwargs are determined from the builder's type and location (in some cases it forces the base_path, data_files and config name for example)

lhoestq avatar Sep 13 '22 11:09 lhoestq