litdata icon indicating copy to clipboard operation
litdata copied to clipboard

Customized BatchSampler + Litdata StreamingDataloader

Open Phimos opened this issue 2 months ago • 1 comments

🚀 Feature

Any way to use customized batchsampler + streaming dataloader?

Motivation

I hope that certain specific samples can be combined into a batch. I can decide each batch before optimizing and try to load the pre-assigned batch one at a time with batch_size=1, but litdata doesn't work (I'm not sure if it's because each batch is too large, samples in a batch is about 2.5 GB).

Phimos avatar Nov 20 '25 05:11 Phimos

Hi @Phimos, At the moment, LitData doesn’t support plugging in a custom sampler directly. However, you can try overriding the internal _create_shuffler method to customize shuffling/sampling behavior and see if it helps with your batching needs.

For reference: https://github.com/Lightning-AI/litData/blob/07705955e698f18a5921e173710a3c726c10b6d2/src/litdata/streaming/dataset.py#L274-L282

bhimrazy avatar Nov 20 '25 08:11 bhimrazy