luminoth icon indicating copy to clipboard operation
luminoth copied to clipboard

thanks! how to train fasterrcnn with multiple GPUs on a server?

Open zzk88862 opened this issue 7 years ago • 5 comments

zzk88862 avatar Apr 14 '18 12:04 zzk88862

I could be wrong, but I believe you can specify distribution parameters. I don't know for local, but for training on the google cloud, you'd just tack this on to your command line: --worker-count (number of GPUs, or greater if you wish to have more than one worker per GPU)

npeirson avatar Apr 19 '18 18:04 npeirson

ok, thanks your answer, What I mean is how to run a single server with multi-card training, not distributed training

zzk88862 avatar Apr 20 '18 09:04 zzk88862

I've only just started exploring Luminoth, but since it's still alpha I'm going to guess you'll need to interact directly with Tensorflow to do that. That being said, I don't think it's terribly difficult; pretty much replacing your single CPU or GPU call with a for gpu in [gpu-1, gpu-2, gpu-3, ... gpu-n]: or similar call. Check out this page for an example.

npeirson avatar Apr 20 '18 17:04 npeirson

okay, thanks for your answer, I will try it

zzk88862 avatar Apr 29 '18 02:04 zzk88862

thank your advice, i have tried multiple gpu by with tf.device('/gpu:i'%i), but it always hits 已杀死, i have debugged many times, but it not solved! followings are some my run messages

1、nvdia messages +-----------------------------------------------------------------------------+ | NVIDIA-SMI 384.111 Driver Version: 384.111 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 Tesla P4 Off | 00000000:81:00.0 Off | 0 | | N/A 61C P0 24W / 75W | 4387MiB / 7606MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 1 Tesla P4 Off | 00000000:82:00.0 Off | 0 | | N/A 62C P0 24W / 75W | 4881MiB / 7606MiB | 0% Default | +-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | 0 24215 C /root/anaconda3/envs/new52/bin/python 4369MiB | | 1 24215 C /root/anaconda3/envs/new52/bin/python 4863MiB | +-----------------------------------------------------------------------------+

2、run result image

zzk88862 avatar May 25 '18 03:05 zzk88862