yangshuo0323
yangshuo0323
> Please provide the complete error message the whole message: ``` [1,5]:[21:43:10] src/storage/storage.cc:110: Using GPUPooledRoundedStorageManager. [1,4]:[21:43:10] src/storage/storage.cc:110: Using GPUPooledRoundedStorageManager. [1,7]:[21:43:10] src/storage/storage.cc:110: Using GPUPooledRoundedStorageManager. [1,6]:[21:43:10] src/storage/storage.cc:110: Using GPUPooledRoundedStorageManager. [1,2]:[21:43:11] src/storage/storage.cc:110: Using...
Firstly, I want to make sure: is my method correct for pre-training BERT model on multiply GPUs? @leezu
> > ``` > > Software environment: Python: 3.7.7, Cuda: 10.2 > > Install MXNet: pip install mxnet-cu102 , verion is 1.7.0 > > Download Model script: https://github.com/dmlc/gluon-nlp, which branch...
I think my environment of 'mpirun' mybe wrong, such as optional parameters: ``` mpirun -np 8 -H localhost:8 -mca pml ob1 -mca btl ^openib \ -mca btl_tcp_if_exclude docker0,lo --map-by ppr:4:socket...
> I have no idea about the 2.0 branch. We may just delete it. > > @yangshuo0323 Feel free try out the BERT pretraining code in https://github.com/dmlc/gluon-nlp/tree/master/scripts/pretraining/bert I have tried...
Ok, I will try out the new version of MXNet and GluonNLP. Thank you so much! > That should work. In fact, is it feasible to try out our new...
The previous error was due to the wrong installation of horovod, which maybe not use the env `HOROVOD_WITH_MXNET`. Thanks to everyone who give me advice above. I will enjoy to...
> * 通过mpi运行时,可添加参数-x NCCL_DEBUG=INFO查看nccl输出 > > ```shell > mpirun -oversubscribe -np ${gpu_num} -H ${node_ip} \ > -bind-to none -map-by numa \ > -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x PATH \ >...
> 您好,我们没有遇到过相同的错误。不过看报错信息: > `Segmentation fault: address not mapped to object at address 0x30` > 像是内存访问越界相关的问题?可以去mxnet官方issue去看看有没有类似的信息。 好的,多谢~