[Task] Support of multi-gpu DistributedDataParallel training
🚀 Feature request
There is an open issue reporting that the NVT PyT data-loader is not correctly supporting torch.nn.parallel.DistributedDataParallel. This feature is to expose the missing option in the T4Rec library.
Motivation
The torch.nn.parallel.DistributedDataParallel option is more optimized than torch.nn.DataParallel() for using multi-processing instead of multi-threading (more info here). So we should make sure that the T4Rec Pytorch API and the Merlin data loader are correctly working when torch.nn.parallel.DistributedDataParallel is set.
Your contribution
T4Rec doesn’t provide a standalone solution for multi-gpu support. However, as we are leveraging the HuggingFace trainer class in the Pytorch API, one can use multi-gpu training options supported by HF (see documentation)
Two options are available:
- The
DataParallelstrategy is working and you can use it by setting theCUDA_VISIBLE_DEVICESenvironment variable. In this setting, the dataloader loads a batch from a dataset and splits it to the different GPUs using multi-threading that processes those chuncks of data in parallel.
--> Set it in a notebook :
import os
os.environ["CUDA_VISIBLE_DEVICES"]="0,1"
--> Set it in a command line script:
CUDA_VISIBLE_DEVICES=0,1 python $YOUR_SCRIPT --{arguments}
- Another option is to run the pipeline using torch.nn.parallel.DistributedDataParallel() and this currently not working in T4Rec.
python -m torch.distributed.launch --nproc_per_node $N_GPUS -m $YOUR_SCRIPT --{arguments}