[Task] Support of multi-gpu DistributedDataParallel training

Open sararb opened this issue 3 years ago • 0 comments

🚀 Feature request

There is an open issue reporting that the NVT PyT data-loader is not correctly supporting torch.nn.parallel.DistributedDataParallel. This feature is to expose the missing option in the T4Rec library.

Motivation

The torch.nn.parallel.DistributedDataParallel option is more optimized than torch.nn.DataParallel() for using multi-processing instead of multi-threading (more info here). So we should make sure that the T4Rec Pytorch API and the Merlin data loader are correctly working when torch.nn.parallel.DistributedDataParallel is set.

Your contribution

T4Rec doesn’t provide a standalone solution for multi-gpu support. However, as we are leveraging the HuggingFace trainer class in the Pytorch API, one can use multi-gpu training options supported by HF (see documentation)

Two options are available:

The DataParallel strategy is working and you can use it by setting the CUDA_VISIBLE_DEVICES environment variable. In this setting, the dataloader loads a batch from a dataset and splits it to the different GPUs using multi-threading that processes those chuncks of data in parallel.

--> Set it in a notebook :

import os
os.environ["CUDA_VISIBLE_DEVICES"]="0,1"

--> Set it in a command line script:

CUDA_VISIBLE_DEVICES=0,1 python $YOUR_SCRIPT --{arguments}

Another option is to run the pipeline using torch.nn.parallel.DistributedDataParallel() and this currently not working in T4Rec.

python -m torch.distributed.launch --nproc_per_node $N_GPUS -m $YOUR_SCRIPT --{arguments}

Jul 19 '22 16:07 sararb