cannot combine splits merging and streaming?
this does not work:
dataset = datasets.load_dataset('mc4','iw',split='train+validation',streaming=True)
with error:
ValueError: Bad split: train+validation. Available splits: ['train', 'validation']
these work:
dataset = datasets.load_dataset('mc4','iw',split='train+validation')
dataset = datasets.load_dataset('mc4','iw',split='train',streaming=True)
dataset = datasets.load_dataset('mc4','iw',split='validation',streaming=True)
i could not find a reference to this in the documentation and the error message is confusing. also would be nice to allow streaming for the merged splits
Hi ! That's missing indeed. We'll try to implement this for the next version :)
I guess we just need to implement #2564 first, and then we should be able to add support for splits combinations
is there an update on this? ran into the same issue on 2.17.1.
On a similar note, the keyword split="all" also does not work as intended when streaming=True.
No update so far, especially since we haven't implemented an efficient way to query split=train[50%:] for example. The addition of two splits should be easy though, since we have concatenate_datasets()
Can you concatenate_datasets that are being streamed now? I was led to believe concatenation works on non streaming datasets only.
Yes concatenate_datasets works for datasets loaded in streaming mode as well