Yibo Zhu comments

Results 134 comments of


                                            Yibo Zhu

MXNet Allgather

We may implement it in a few months.

Do i still need to set kv_store in using mxnet? why?

Below are some numbers. The following experiments are performed on a public cloud with 20 Gbps networks. Each machine has 8 Tesla V100 16GB GPUs (with NVLink-enabled). The batch size...

Do i still need to set kv_store in using mxnet? why?

Yes. You can try them yourself. The original ps-lite implementation is pretty poor -- it is slower than Horovod, let alone BytePS.

a naïve questions on using byteps for distributed training

There is a launcher you can try, see the README in the folder https://github.com/bytedance/byteps/tree/master/launcher

Pytorch Docker image fails to train MNIST with multiple GPUs

Hello @nowei , would you confirm that you can run EVAL_TYPE=benchmark with multiple GPUs? If so, we can narrow down the problem to be in `train_mnist_byteps.py`

Pytorch Docker image fails to train MNIST with multiple GPUs

Would you set NCCL_DEBUG=INFO and run again? You may also set BYTEPS_LOG_LEVEL=INFO or even BYTEPS_LOG_LEVEL=TRACE. Then paste us the logs (it may be very long if you set BYTEPS_LOG_LEVEL=TRACE). Thanks.

Pytorch Docker image fails to train MNIST with multiple GPUs

@nowei Thank you. You are right. INFO does not give anything new. The useful level is DEBUG. However, TRACE would include anything that DEBUG outputs, so what you have is...

Pytorch Docker image fails to train MNIST with multiple GPUs

@nowei If you repeat multiple times with TRACE logs, does it always die on the key `1048576`? From the logs you paste, you can see that the last few lines...

Pytorch Docker image fails to train MNIST with multiple GPUs

Thanks. This is very helpful. So, it's a deterministic bug. There has to be something special about this tensor `byteps.Parameter.dampening.0_0`

Pytorch Docker image fails to train MNIST with multiple GPUs

@nowei Would you do one more favor? Comment out this line and try again. https://github.com/bytedance/byteps/blob/master/example/pytorch/train_mnist_byteps.py#L109