Andrew Wei
Andrew Wei
Yeah, it's running it in the first screenshot for ten iterations, so yeah, it can run EVAL_TYPE=benchmark with multiple GPUs.
Yeah, sorry for the late reply. It takes a while to set up each time and I was a bit busy over the last few days. Here's the log with...
Here's the log with both NCCL_DEBUG=INFO and BYTEPS_LOG_LEVEL=TRACE: ``` BytePS launching worker training mnist... training mnist... training mnist... training mnist... [2019-12-01 06:56:10.396131: D byteps/common/communicator.cc:63] Using Communicator=Socket [2019-12-01 06:56:10.396279: D byteps/common/communicator.cc:157]...
I ran it a few more times and they all died on key 1048576.
It ended up training for one epoch and then it crashed again. It ended with something like this: ``` Train Epoch: 1 [14720/15000 (98%)] Loss: 0.529198 Train Epoch: 1 [14720/15000...
I think test loss is a float at that point because it was a float initially, since I'm getting `AttributeError: 'float' object has no attribute 'cuda'` I'll try making it...
It's still dying there: ``` [2019-12-01 09:49:15.741209: T byteps/common/scheduled_queue.cc:153] Queue BROADCAST getTask(key): byteps.Gradient.conv1.bias_0 key: 0 rank: 0 [2019-12-01 09:49:15.741213: T byteps/common/scheduled_queue.cc:153] Queue BROADCAST getTask(key): byteps.Gradient.conv1.bias_0 key: 0 rank: 1 [2019-12-01...
Oh, hey, it's training! Thanks!