datasets icon indicating copy to clipboard operation
datasets copied to clipboard

Requesting a specific split (eg: test) still downloads all (train, test, val) data when streaming=False.

Open s3pi opened this issue 8 months ago • 2 comments

Describe the bug

When using load_dataset() from the datasets library (in load.py), specifying a particular split (e.g., split="train") still results in downloading data for all splits when streaming=False. This happens during the builder_instance.download_and_prepare() call. This behavior leads to unnecessary bandwidth usage and longer download times, especially for large datasets, even if the user only intends to use a single split.

Steps to reproduce the bug

dataset_name = "skbose/indian-english-nptel-v0" dataset = load_dataset(dataset_name, token=hf_token, split="test")

Expected behavior

Optimize the download logic so that only the required split is downloaded when streaming=False when a specific split is provided.

Environment info

Dataset: skbose/indian-english-nptel-v0 Platform: M1 Apple Silicon Python verison: 3.12.9 datasets>=3.5.0

s3pi avatar May 22 '25 11:05 s3pi

Hi ! There was a PR open to improve this: https://github.com/huggingface/datasets/pull/6832 but it hasn't been continued so far.

It would be a cool improvement though !

lhoestq avatar May 26 '25 18:05 lhoestq

Been having this problem with datasets and dataloader for a while.

AbstractEyes avatar Nov 05 '25 16:11 AbstractEyes