Requesting a specific split (eg: test) still downloads all (train, test, val) data when streaming=False.
Describe the bug
When using load_dataset() from the datasets library (in load.py), specifying a particular split (e.g., split="train") still results in downloading data for all splits when streaming=False. This happens during the builder_instance.download_and_prepare() call. This behavior leads to unnecessary bandwidth usage and longer download times, especially for large datasets, even if the user only intends to use a single split.
Steps to reproduce the bug
dataset_name = "skbose/indian-english-nptel-v0" dataset = load_dataset(dataset_name, token=hf_token, split="test")
Expected behavior
Optimize the download logic so that only the required split is downloaded when streaming=False when a specific split is provided.
Environment info
Dataset: skbose/indian-english-nptel-v0 Platform: M1 Apple Silicon Python verison: 3.12.9 datasets>=3.5.0
Hi ! There was a PR open to improve this: https://github.com/huggingface/datasets/pull/6832 but it hasn't been continued so far.
It would be a cool improvement though !
Been having this problem with datasets and dataloader for a while.