Andrew Wei comments

Results 8 comments of


                                            Andrew Wei

Pytorch Docker image fails to train MNIST with multiple GPUs

Yeah, it's running it in the first screenshot for ten iterations, so yeah, it can run EVAL_TYPE=benchmark with multiple GPUs.

Pytorch Docker image fails to train MNIST with multiple GPUs

Yeah, sorry for the late reply. It takes a while to set up each time and I was a bit busy over the last few days. Here's the log with...

Pytorch Docker image fails to train MNIST with multiple GPUs

Here's the log with both NCCL_DEBUG=INFO and BYTEPS_LOG_LEVEL=TRACE: ``` BytePS launching worker training mnist... training mnist... training mnist... training mnist... [2019-12-01 06:56:10.396131: D byteps/common/communicator.cc:63] Using Communicator=Socket [2019-12-01 06:56:10.396279: D byteps/common/communicator.cc:157]...

Pytorch Docker image fails to train MNIST with multiple GPUs

I ran it a few more times and they all died on key 1048576.

Pytorch Docker image fails to train MNIST with multiple GPUs

It ended up training for one epoch and then it crashed again. It ended with something like this: ``` Train Epoch: 1 [14720/15000 (98%)] Loss: 0.529198 Train Epoch: 1 [14720/15000...

Pytorch Docker image fails to train MNIST with multiple GPUs

I think test loss is a float at that point because it was a float initially, since I'm getting `AttributeError: 'float' object has no attribute 'cuda'` I'll try making it...

Pytorch Docker image fails to train MNIST with multiple GPUs

It's still dying there: ``` [2019-12-01 09:49:15.741209: T byteps/common/scheduled_queue.cc:153] Queue BROADCAST getTask(key): byteps.Gradient.conv1.bias_0 key: 0 rank: 0 [2019-12-01 09:49:15.741213: T byteps/common/scheduled_queue.cc:153] Queue BROADCAST getTask(key): byteps.Gradient.conv1.bias_0 key: 0 rank: 1 [2019-12-01...

Pytorch Docker image fails to train MNIST with multiple GPUs

Oh, hey, it's training! Thanks!