pba icon indicating copy to clipboard operation
pba copied to clipboard

when I run train.py there is only one GPU has utilization

Open csyuan opened this issue 5 years ago • 3 comments

I have 4 GPUs, when I run train.py with --num_samples 1 --gpu 4,there is only one GPU has utilization. Is it because the model does not support multiple GPUs? But when I run search.py with --num_samples 16 --gpu 0.25 , all GPUs has utilization.

csyuan avatar Mar 03 '20 06:03 csyuan

Although the code has only been tested with 1 GPU, you usually have to specify CUDA_VISIBLE_DEVICES to constrain the search.

Perhaps a newer version of Ray makes this constraint instead?

arcelien avatar Mar 03 '20 07:03 arcelien

I've added CUDA_VISIBLE_DEVICES=0,1,2,3 to the scripts。 It's no problem on search.py, because run search.py, gpu is a fractional value, and num_samples > 1 . It works with multiple GPUs on Ray. But, I mean train.py --num_samples=1, --gpu=4, ... here gpu nums > 1, num_samples=1, it means resources_per_trial:{"gpu":4} , classification model train is too slow, it does not work on 4 GPUs, only one GPU has utilization。

So, when --num_samples=1,--gpu>1 , model could not work with multiple GPUs on Ray ?

csyuan avatar Mar 03 '20 07:03 csyuan

I see; single model parallelism across GPUs is unfortunately not supported.

arcelien avatar Mar 04 '20 03:03 arcelien