benchmarks icon indicating copy to clipboard operation
benchmarks copied to clipboard

ncclCommInitRank error for 2 GPUs from a single machine

Open jw447 opened this issue 6 years ago • 0 comments

I'm trying to run tf_cnn_benchmark.py on Power9 machine.

When i tried to run the benchmark with horovod using 1 GPU, it worked fine;

When I tried to use 2 GPUs from a single node, I got the error of "ncclCommInitRank failed: unhandled cuda error".

Then I tried to run the same benchmark with 2 GPUs each from a node and it worked fine.

So how do I leverage the multiple GPUs from a single node with horovod or I have to use other distributed learning api?

jw447 avatar Mar 13 '19 21:03 jw447