yangshuo0323 comments

Results 9 comments of


                                            yangshuo0323

Have problom in BERT pre-training: how to training on multiple GPUs

> Please provide the complete error message the whole message: ``` [1,5]:[21:43:10] src/storage/storage.cc:110: Using GPUPooledRoundedStorageManager. [1,4]:[21:43:10] src/storage/storage.cc:110: Using GPUPooledRoundedStorageManager. [1,7]:[21:43:10] src/storage/storage.cc:110: Using GPUPooledRoundedStorageManager. [1,6]:[21:43:10] src/storage/storage.cc:110: Using GPUPooledRoundedStorageManager. [1,2]:[21:43:11] src/storage/storage.cc:110: Using...

Have problom in BERT pre-training: how to training on multiple GPUs

Firstly, I want to make sure: is my method correct for pre-training BERT model on multiply GPUs? @leezu

Have problom in BERT pre-training: how to training on multiple GPUs

> > ``` > > Software environment: Python: 3.7.7, Cuda: 10.2 > > Install MXNet: pip install mxnet-cu102 , verion is 1.7.0 > > Download Model script: https://github.com/dmlc/gluon-nlp, which branch...

Have problom in BERT pre-training: how to training on multiple GPUs

I think my environment of 'mpirun' mybe wrong, such as optional parameters: ``` mpirun -np 8 -H localhost:8 -mca pml ob1 -mca btl ^openib \ -mca btl_tcp_if_exclude docker0,lo --map-by ppr:4:socket...

Have problom in BERT pre-training: how to training on multiple GPUs

> I have no idea about the 2.0 branch. We may just delete it. > > @yangshuo0323 Feel free try out the BERT pretraining code in https://github.com/dmlc/gluon-nlp/tree/master/scripts/pretraining/bert I have tried...

Have problom in BERT pre-training: how to training on multiple GPUs

Ok, I will try out the new version of MXNet and GluonNLP. Thank you so much! > That should work. In fact, is it feasible to try out our new...

Have problom in BERT pre-training: how to training on multiple GPUs

The previous error was due to the wrong installation of horovod, which maybe not use the env `HOROVOD_WITH_MXNET`. Thanks to everyone who give me advice above. I will enjoy to...

gluon-mxnet-bert多机速度慢问题

> * 通过mpi运行时，可添加参数-x NCCL_DEBUG=INFO查看nccl输出 > > ```shell > mpirun -oversubscribe -np ${gpu_num} -H ${node_ip} \ > -bind-to none -map-by numa \ > -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x PATH \ >...

gluon-mxnet-bert多机速度慢问题

> 您好，我们没有遇到过相同的错误。不过看报错信息： > `Segmentation fault: address not mapped to object at address 0x30` > 像是内存访问越界相关的问题？可以去mxnet官方issue去看看有没有类似的信息。好的，多谢~