torchx icon indicating copy to clipboard operation
torchx copied to clipboard

-j vs --cpu/--gpu in ddp

Open godfrey-cw opened this issue 2 years ago • 1 comments

📚 Documentation

Link

https://pytorch.org/torchx/latest/components/distributed.html

What does it currently say?

Not clear whether --cpu, --gpu arguments are overrided by -j arguments, although in my testing (launch then run top, etc.) it seems they are?

What should it say?

Both the docs and the --help output for dist.ddp could be more clear on this front. More generally, I am wondering if there exists a torchx equivalent of torchrun --standalone --nnodes=1 --nproc_per_node=auto ....

Why?

Clearly I wouldn't want --gpu=0 with -j 1x2, right? As such the listed defaults in docs --help are a little confusing.

godfrey-cw avatar Jul 05 '23 15:07 godfrey-cw

You could try torchx run dist.spmd -j $NNODES (without specifying the $NPROC_PER_NODE part). See: https://github.com/pytorch/torchx/blob/main/torchx/components/dist.py#L130

This will autoset nproc_per_node to the number of GPUs specified in -h (named host). If you are running on an AWS instance you can use any of these named resources as the -h argument (https://github.com/pytorch/torchx/blob/main/torchx/specs/named_resources_aws.py#L191). Otherwise there are some generic ones mapped here: https://github.com/pytorch/torchx/blob/main/torchx/specs/named_resources_generic.py#L47-L50

kiukchung avatar Jul 12 '23 20:07 kiukchung