Multi-Config Dataset Loading / Wildcard Config Identifiers

Open MiWeiss opened this issue 5 years ago • 1 comments

Some datasets, e.g. mnist_corrupted provide various configurations which are in many use-cases all used at the same time (without distinction). Afaik, tfds currently requires to load every dataset independenty and then concatenate them, e.g.

c1_ds = tfds.load("mnist_corrupted/shot_noise")
c2_ds = tfds.load("mnist_corrupted/impulse_noise")
c3_ds = tfds.load("mnist_corrupted/glass_blur")
# ... many, many, many rows :-)
cX_ds = tfds.load("mnist_corrupted/...")

dataset_i_want = some_concat_function((c1_ds,c2_ds, c3_ds, ..., cX_ds))

For tfds with a large number of configs, this snippet can become quite long. Also, it makes it hard to use some of the nice features of load over the full dataset (e.g. shuffle_files or split).

Describe the solution you'd like Some way to use wildcards over tfds configs, e.g.,

# Contains the datasets for all configs, nicely shuffled
dataset_i_want = tfds.load("mnist_corrupted/*", shuffle_files=True)

# Contains the shot_noise and impulse_noise config datasets
dataset_i_want = tfds.load("mnist_corrupted/*_noise", shuffle_files=True)

Describe alternatives you've considered

Option 1: Manual concatenations, as shown in the example above. Option 2: Looping over all BuilderConfigs in the DatasetBuilder class (e.g. MNISTCorrupted.BUILDER_CONFIGS), manually implementing wildcard pattern matching on the config names. However, I do not think the BUILDER_CONFIGS field is documented and probably not guaranteed to exist on all DatasetBuilders? Also, shuffling and splitting is still hard.

Dec 31 '20 13:12 MiWeiss

I also want to know how to do this. I want to be able to use only specific types of corruption, combine only those and then use that combined dataset as normal. I also want to be able to combine it with my own version of corruption if possible.

Apr 25 '22 06:04 sakgoyal