data Distributed training tutorial with DataLoader2

📚 The doc issue

I am not sure how to implement distributed training.

Suggest a potential alternative/fix

If there was a simple example that showed how to use DDP with the torchdata library it would be super helpful.

Jul 25 '22 22:07 MatthewCaseres

I see some blog post that says to implement something like this -

train_sampler = torch.utils.data.distributed.DistributedSampler(train_dataset)
train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=..., sampler=train_sampler)

I'm not sure what the way of doing this is that torchdata offers

Jul 25 '22 23:07 MatthewCaseres

CC @ejguan

Jul 26 '22 15:07 VitalyFedyunin

It makes sense. We should add a section for distributed training DataPipe with the existing DataLoader. And, after DataLoader2 + DistributedReadingService becomes beta stage, we can add tutorial for them as well. Edit: Unfortunately, DistributedReadingServiceis still WIP to make DataPipe working withDataLoader2` for distributed training.

@MatthewCaseres The TLDR for distributed training with DataPipe: you should add sharding_filter operation in your pipeline similar to what we suggest in the tutorial regarding multiprocessing. Then, DataLoader would execute sharding over your pipeline based on your distribtued setting automatically.

Jul 26 '22 16:07 ejguan

Hi @ejguan , is the status for this still WIP? Thanks

Jan 11 '23 13:01 austinmw

@austinmw Thanks for asking. Feature-wise, pure distributed training is ready and I am working on a tutorial currently. TLDR of the tutorial would be:

datapipe = source_dp.shuffle().sharding_filter()  # Make sure distributed sharding is mutually exclusive and collectively exhaustive
datapipe = datapipe.map(transform_fn)
...
def main(rank, world_size)
    dist.init_process_group(rank=rank, world_size=world_size)
    rs = DistributedReadingService()
    dl = DataLoader2(datapipe, reading_service=rs)
    for epoch in range(10):
        torch.manual_seed(epoch)  # Set determinism for all ranks
        for d in dl:
            ...

I have one more blocking feature for distributed + MP of making non-replicable DataPipe remains in the main process. With such a feature, I will add a tutorial to support distributed + MP use case.

Jan 11 '23 15:01 ejguan

Thanks for your response, and looking forward to reading your tutorial!

Jan 11 '23 16:01 austinmw

Hi @ejguan , I noticed this is closed. Is there a tutorial for distributed + MP? I see tutorial for DistributedReadingService, but I thought that this would be implemented with SequentialReadingService?

Feb 01 '23 17:02 austinmw

SequentialReadingService is still WIP. I will add a tutorial with MP + Dist after it's landed.

Feb 01 '23 17:02 ejguan

This feature is tracked in https://github.com/pytorch/data/issues/911

Feb 01 '23 17:02 ejguan