-j vs --cpu/--gpu in ddp
📚 Documentation
Link
https://pytorch.org/torchx/latest/components/distributed.html
What does it currently say?
Not clear whether --cpu, --gpu arguments are overrided by -j arguments, although in my testing (launch then run top, etc.) it seems they are?
What should it say?
Both the docs and the --help output for dist.ddp could be more clear on this front. More generally, I am wondering if there exists a torchx equivalent of torchrun --standalone --nnodes=1 --nproc_per_node=auto ....
Why?
Clearly I wouldn't want --gpu=0 with -j 1x2, right? As such the listed defaults in docs --help are a little confusing.
You could try torchx run dist.spmd -j $NNODES (without specifying the $NPROC_PER_NODE part). See: https://github.com/pytorch/torchx/blob/main/torchx/components/dist.py#L130
This will autoset nproc_per_node to the number of GPUs specified in -h (named host). If you are running on an AWS instance you can use any of these named resources as the -h argument (https://github.com/pytorch/torchx/blob/main/torchx/specs/named_resources_aws.py#L191).
Otherwise there are some generic ones mapped here: https://github.com/pytorch/torchx/blob/main/torchx/specs/named_resources_generic.py#L47-L50