datasets icon indicating copy to clipboard operation
datasets copied to clipboard

cannot combine splits merging and streaming?

Open eyaler opened this issue 4 years ago • 5 comments

this does not work: dataset = datasets.load_dataset('mc4','iw',split='train+validation',streaming=True) with error: ValueError: Bad split: train+validation. Available splits: ['train', 'validation']

these work: dataset = datasets.load_dataset('mc4','iw',split='train+validation') dataset = datasets.load_dataset('mc4','iw',split='train',streaming=True) dataset = datasets.load_dataset('mc4','iw',split='validation',streaming=True)

i could not find a reference to this in the documentation and the error message is confusing. also would be nice to allow streaming for the merged splits

eyaler avatar Jul 22 '21 01:07 eyaler

Hi ! That's missing indeed. We'll try to implement this for the next version :)

I guess we just need to implement #2564 first, and then we should be able to add support for splits combinations

lhoestq avatar Jul 22 '21 08:07 lhoestq

is there an update on this? ran into the same issue on 2.17.1.

On a similar note, the keyword split="all" also does not work as intended when streaming=True.

kdonbekci avatar Mar 21 '24 03:03 kdonbekci

No update so far, especially since we haven't implemented an efficient way to query split=train[50%:] for example. The addition of two splits should be easy though, since we have concatenate_datasets()

lhoestq avatar Mar 21 '24 09:03 lhoestq

Can you concatenate_datasets that are being streamed now? I was led to believe concatenation works on non streaming datasets only.

prajwalpkn avatar Apr 05 '24 00:04 prajwalpkn

Yes concatenate_datasets works for datasets loaded in streaming mode as well

lhoestq avatar Apr 08 '24 13:04 lhoestq