SAM-Adapter-PyTorch icon indicating copy to clipboard operation
SAM-Adapter-PyTorch copied to clipboard

How to run on multiple machines ?

Open AnnemSony opened this issue 2 years ago • 5 comments

AnnemSony avatar Jul 06 '23 04:07 AnnemSony

Do you mean multiple GPUs?

tianrun-chen avatar Jul 09 '23 01:07 tianrun-chen

I have GPU'S in multiple machine(means on node clusters), how can I run the command.

AnnemSony avatar Jul 09 '23 04:07 AnnemSony

Hi , I have 4 gpus and trying to tune the SAM-Adapter model I used the command provided in git command used : CUDA_VISIBLE_DEVICES=0,1,2,3 python -m torch.distributed.launch train.py --nnodes 1 --nproc_per_node 4 --config configs/demo.yaml

I have successed training but i found that there is only one gpu is used !! how can i solve this problem ...?(I have checked the documents of torch but don't have any idea to debug it ...? @tianrun-chen

chusheng0505 avatar Jul 11 '23 08:07 chusheng0505

I also encountered this problem, and only O cards were used during distributed training. At the same time, I did not find the input of these two parameters --nnodes 1 --nproc_per_node 4 in the input of train.py. Why?

Bill-Ren avatar Jul 23 '23 07:07 Bill-Ren

Hi , I have 4 gpus and trying to tune the SAM-Adapter model I used the command provided in git command used : CUDA_VISIBLE_DEVICES=0,1,2,3 python -m torch.distributed.launch train.py --nnodes 1 --nproc_per_node 4 --config configs/demo.yaml

I have successed training but i found that there is only one gpu is used !! how can i solve this problem ...?(I have checked the documents of torch but don't have any idea to debug it ...? @tianrun-chen

I found a solution to the problem. Finally, I should run the code like this: CUDA_VISIBLE_DEVICES=0,1,2,3 python -m torch.distributed.launch --nnodes 1 --nproc_per_node 4 train.py --config configs/demo.yaml --tag exp1 , you can check the usage of torch.distributed.launch for details

Bill-Ren avatar Jul 24 '23 03:07 Bill-Ren