DLPerf icon indicating copy to clipboard operation
DLPerf copied to clipboard

gluon-mxnet-bert多机速度慢问题

Open Flowingsun007 opened this issue 5 years ago • 4 comments

简介

horovod是支持pytorch,tensorflow,mxnet多机分布式训练的库,其底层机器间通讯依赖nccl或mpi,所以安装前通常需要先安装好nccl、openmpi,且至少安装了一种深度学习框架,譬如mxnet:

python3 -m pip install gluonnlp==0.10.0 mxnet-cu102mkl==1.6.0.post0 -i https://mirror.baidu.com/pypi/simple

问题解决

安装好依赖后,可以进行horovod的安装,horovod安装时,需为NCCL指定相关变量,否则运行时可能不会走nccl通讯导致速度很慢。详细安装过程:https://github.com/horovod/horovod/blob/master/docs/gpus.rst

  • 安装horovod时,需指定NCCL相关变量:
HOROVOD_WITH_MXNET=1  HOROVOD_GPU_OPERATIONS=NCCL HOROVOD_GPU_ALLREDUCE=NCCL HOROVOD_GPU_BROADCAST=NCCL

如果不指定变量直接安装,则用horovodrun时也能运行,不过速度会很慢,因为其底层并未走nccl,直接走的是mpi通信

其他

  • 可以在运行时添加--log-level参数为INFO或者DEBUG来查看详细输出
horovodrun -np ${gpu_num} -H ${node_ip}   -p ${PORT} \
--start-timeout 600 --log-level INFO \
python3 ${WORKSPACE}/run_pretraining.py ${CMD} 2>&1 | tee ${log_file}
  • 通过mpi运行时,可添加参数-x NCCL_DEBUG=INFO查看nccl输出
mpirun -oversubscribe -np ${gpu_num} -H ${node_ip} \
    -bind-to none -map-by numa \
    -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x PATH \
    -mca pml ob1 -mca btl ^openib \
    -mca plm_rsh_args "-p 22 -q -o StrictHostKeyChecking=no" \
    -mca btl_tcp_if_include ib0 \
python3 ${WORKSPACE}/run_pretraining.py ${CMD} 2>&1 | tee ${log_file}

Flowingsun007 avatar Sep 08 '20 15:09 Flowingsun007

训练过程中还发现:

  • 1.通过horovodrun时,数据加载时会48core占满100%,正式训练开始不会满(差不多是能有48个core都占用不到50%)
  • 2.mpirun -bind-to none时效果同上;mpirun -bind-to core,这种方式可以限制进程数,使得2机数据加载和训练时只会占用16个core,不过速度是比占满时慢一些的(肉眼估算大约30%)
  • 3.horovodrun时1组7次训练几乎只有1~2次能正常跑完、mpirun就比较稳基本不会报异常,具体原因未能确定(可能是horovodrun不太稳定或者遭遇端口通信异常)

mpirun参数可见:https://www.open-mpi.org/doc/current/man1/mpirun.1.php

Flowingsun007 avatar Sep 08 '20 16:09 Flowingsun007

  • 通过mpi运行时,可添加参数-x NCCL_DEBUG=INFO查看nccl输出
mpirun -oversubscribe -np ${gpu_num} -H ${node_ip} \
    -bind-to none -map-by numa \
    -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x PATH \
    -mca pml ob1 -mca btl ^openib \
    -mca plm_rsh_args "-p 22 -q -o StrictHostKeyChecking=no" \
    -mca btl_tcp_if_include ib0 \
python3 ${WORKSPACE}/run_pretraining.py ${CMD} 2>&1 | tee ${log_file}

@Flowingsun007 hi, 看到你用mxnet跑通了多卡训练。我也是想用mxnet在多GPU上训练bert,但是我使用上面这个命令时候会报段错误, 请教下你遇到过吗:

[node123:209584:0:209584] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x30)
==== backtrace ====
    0  /usr/lib/libucs.so.0(+0x1fcec) [0x7f26eaf57cec]
    1  /usr/lib/libucs.so.0(+0x1ff64) [0x7f26eaf57f64]
    2  /lib/x86_64-linux-gnu/libpthread.so.0(pthread_mutex_lock+0x4) [0x7f29dcf2dd44]
    3  /home/yangshuo/miniconda3/envs/yangshuo/lib/python3.7/site-packages/mxnet/libmxnet.so(_ZN5mxnet6engine11ThreadedVar21AppendWriteDependencyEPNS0_8OprBlockE+0x44) [0x7f297f1dd564]
    4  /home/yangshuo/miniconda3/envs/yangshuo/lib/python3.7/site-packages/mxnet/libmxnet.so(_ZN5mxnet6engine14ThreadedEngine4PushEPNS0_3OprENS_7ContextEib+0x280) [0x7f297f1e0790]
    5  /home/yangshuo/miniconda3/envs/yangshuo/lib/python3.7/site-packages/mxnet/libmxnet.so(_ZN5mxnet6engine14ThreadedEngine9PushAsyncESt8functionIFvNS_10RunContextENS0_18CallbackOnCompleteEEENS_7ContextERKSt6vectorIPNS0_3VarESaISA_EESE_NS_10FnPropertyEiPKcb+0x131) [0x7f297f1d8ed1]
    6  /home/yangshuo/miniconda3/envs/yangshuo/lib/python3.7/site-packages/mxnet/libmxnet.so(_ZN5mxnet10CopyFromToERKNS_7NDArrayES2_ib+0xaf4) [0x7f297f1b39d4]
    7  /home/yangshuo/miniconda3/envs/yangshuo/lib/python3.7/site-packages/horovod/mxnet/mpi_lib.cpython-37m-x86_64-linux-gnu.so(_ZN7horovod5mxnet29PushHorovodOperationCudaOnCPUENS_6common7Request11RequestTypeEPN5mxnet7NDArrayES6_PKcii+0xe6f) [0x7f270014e18f]
    8  /home/yangshuo/miniconda3/envs/yangshuo/lib/python3.7/site-packages/horovod/mxnet/mpi_lib.cpython-37m-x86_64-linux-gnu.so(horovod_mxnet_broadcast_async+0x54) [0x7f2700145d84]
    9  /home/yangshuo/miniconda3/envs/yangshuo/lib/python3.7/lib-dynload/../../libffi.so.7(+0x69dd) [0x7f29dbf9a9dd]
   10  /home/yangshuo/miniconda3/envs/yangshuo/lib/python3.7/lib-dynload/../../libffi.so.7(+0x6067) [0x7f29dbf9a067]
   11  /home/yangshuo/miniconda3/envs/yangshuo/lib/python3.7/lib-dynload/_ctypes.cpython-37m-x86_64-linux-gnu.so(_ctypes_callproc+0x2ce) [0x7f29dd18b27e]
   12  /home/yangshuo/miniconda3/envs/yangshuo/lib/python3.7/lib-dynload/_ctypes.cpython-37m-x86_64-linux-gnu.so(+0x12cb4) [0x7f29dd18bcb4]
   13  python(_PyObject_FastCallKeywords+0x48b) [0x5571c7a7600b]
   14  python(_PyEval_EvalFrameDefault+0x51d1) [0x5571c7ada9a1]
   15  python(_PyEval_EvalCodeWithName+0x2f9) [0x5571c7a1e2b9]
   16  python(_PyFunction_FastCallKeywords+0x387) [0x5571c7a6e497]
   17  python(_PyEval_EvalFrameDefault+0x14ea) [0x5571c7ad6cba]
   18  python(_PyEval_EvalCodeWithName+0x2f9) [0x5571c7a1e2b9]
   19  python(_PyFunction_FastCallKeywords+0x387) [0x5571c7a6e497]
   20  python(_PyEval_EvalFrameDefault+0x14ea) [0x5571c7ad6cba]
   21  python(_PyFunction_FastCallKeywords+0xfb) [0x5571c7a6e20b]
   22  python(_PyEval_EvalFrameDefault+0x416) [0x5571c7ad5be6]
   23  python(_PyEval_EvalCodeWithName+0x2f9) [0x5571c7a1e2b9]
   24  python(PyEval_EvalCodeEx+0x44) [0x5571c7a1f1d4]
   25  python(PyEval_EvalCode+0x1c) [0x5571c7a1f1fc]
   26  python(+0x22bf44) [0x5571c7b34f44]
   27  python(PyRun_FileExFlags+0xa1) [0x5571c7b3f2b1]
   28  python(PyRun_SimpleFileExFlags+0x1c3) [0x5571c7b3f4a3]
   29  python(+0x2375d5) [0x5571c7b405d5]
   30  python(_Py_UnixMain+0x3c) [0x5571c7b406fc]
   31  /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf0) [0x7f29dcb7a840]
   32  python(+0x1dc3c0) [0x5571c7ae53c0]
===================
INFO:root:Model created
DEBUG:root:Random seed set to 100
INFO:root:Begin process dataset......
INFO:root:args.num_buckets: 1, num_workers: 8, rank: 5
INFO:root:400 files are found.
INFO:root:Model created
DEBUG:root:Random seed set to 100
INFO:root:Begin process dataset......
INFO:root:args.num_buckets: 1, num_workers: 8, rank: 2
INFO:root:400 files are found.
--------------------------------------------------------------------------
mpirun noticed that process rank 4 with PID 0 on node node123 exited on signal 11 (Segmentation fault).

yangshuo0323 avatar Jan 30 '21 03:01 yangshuo0323

  • 通过mpi运行时,可添加参数-x NCCL_DEBUG=INFO查看nccl输出
mpirun -oversubscribe -np ${gpu_num} -H ${node_ip} \
    -bind-to none -map-by numa \
    -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x PATH \
    -mca pml ob1 -mca btl ^openib \
    -mca plm_rsh_args "-p 22 -q -o StrictHostKeyChecking=no" \
    -mca btl_tcp_if_include ib0 \
python3 ${WORKSPACE}/run_pretraining.py ${CMD} 2>&1 | tee ${log_file}

@Flowingsun007 hi, 看到你用mxnet跑通了多卡训练。我也是想用mxnet在多GPU上训练bert,但是我使用上面这个命令时候会报段错误, 请教下你遇到过吗:

[node123:209584:0:209584] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x30)
==== backtrace ====
    0  /usr/lib/libucs.so.0(+0x1fcec) [0x7f26eaf57cec]
    1  /usr/lib/libucs.so.0(+0x1ff64) [0x7f26eaf57f64]
    2  /lib/x86_64-linux-gnu/libpthread.so.0(pthread_mutex_lock+0x4) [0x7f29dcf2dd44]
    3  /home/yangshuo/miniconda3/envs/yangshuo/lib/python3.7/site-packages/mxnet/libmxnet.so(_ZN5mxnet6engine11ThreadedVar21AppendWriteDependencyEPNS0_8OprBlockE+0x44) [0x7f297f1dd564]
    4  /home/yangshuo/miniconda3/envs/yangshuo/lib/python3.7/site-packages/mxnet/libmxnet.so(_ZN5mxnet6engine14ThreadedEngine4PushEPNS0_3OprENS_7ContextEib+0x280) [0x7f297f1e0790]
    5  /home/yangshuo/miniconda3/envs/yangshuo/lib/python3.7/site-packages/mxnet/libmxnet.so(_ZN5mxnet6engine14ThreadedEngine9PushAsyncESt8functionIFvNS_10RunContextENS0_18CallbackOnCompleteEEENS_7ContextERKSt6vectorIPNS0_3VarESaISA_EESE_NS_10FnPropertyEiPKcb+0x131) [0x7f297f1d8ed1]
    6  /home/yangshuo/miniconda3/envs/yangshuo/lib/python3.7/site-packages/mxnet/libmxnet.so(_ZN5mxnet10CopyFromToERKNS_7NDArrayES2_ib+0xaf4) [0x7f297f1b39d4]
    7  /home/yangshuo/miniconda3/envs/yangshuo/lib/python3.7/site-packages/horovod/mxnet/mpi_lib.cpython-37m-x86_64-linux-gnu.so(_ZN7horovod5mxnet29PushHorovodOperationCudaOnCPUENS_6common7Request11RequestTypeEPN5mxnet7NDArrayES6_PKcii+0xe6f) [0x7f270014e18f]
    8  /home/yangshuo/miniconda3/envs/yangshuo/lib/python3.7/site-packages/horovod/mxnet/mpi_lib.cpython-37m-x86_64-linux-gnu.so(horovod_mxnet_broadcast_async+0x54) [0x7f2700145d84]
    9  /home/yangshuo/miniconda3/envs/yangshuo/lib/python3.7/lib-dynload/../../libffi.so.7(+0x69dd) [0x7f29dbf9a9dd]
   10  /home/yangshuo/miniconda3/envs/yangshuo/lib/python3.7/lib-dynload/../../libffi.so.7(+0x6067) [0x7f29dbf9a067]
   11  /home/yangshuo/miniconda3/envs/yangshuo/lib/python3.7/lib-dynload/_ctypes.cpython-37m-x86_64-linux-gnu.so(_ctypes_callproc+0x2ce) [0x7f29dd18b27e]
   12  /home/yangshuo/miniconda3/envs/yangshuo/lib/python3.7/lib-dynload/_ctypes.cpython-37m-x86_64-linux-gnu.so(+0x12cb4) [0x7f29dd18bcb4]
   13  python(_PyObject_FastCallKeywords+0x48b) [0x5571c7a7600b]
   14  python(_PyEval_EvalFrameDefault+0x51d1) [0x5571c7ada9a1]
   15  python(_PyEval_EvalCodeWithName+0x2f9) [0x5571c7a1e2b9]
   16  python(_PyFunction_FastCallKeywords+0x387) [0x5571c7a6e497]
   17  python(_PyEval_EvalFrameDefault+0x14ea) [0x5571c7ad6cba]
   18  python(_PyEval_EvalCodeWithName+0x2f9) [0x5571c7a1e2b9]
   19  python(_PyFunction_FastCallKeywords+0x387) [0x5571c7a6e497]
   20  python(_PyEval_EvalFrameDefault+0x14ea) [0x5571c7ad6cba]
   21  python(_PyFunction_FastCallKeywords+0xfb) [0x5571c7a6e20b]
   22  python(_PyEval_EvalFrameDefault+0x416) [0x5571c7ad5be6]
   23  python(_PyEval_EvalCodeWithName+0x2f9) [0x5571c7a1e2b9]
   24  python(PyEval_EvalCodeEx+0x44) [0x5571c7a1f1d4]
   25  python(PyEval_EvalCode+0x1c) [0x5571c7a1f1fc]
   26  python(+0x22bf44) [0x5571c7b34f44]
   27  python(PyRun_FileExFlags+0xa1) [0x5571c7b3f2b1]
   28  python(PyRun_SimpleFileExFlags+0x1c3) [0x5571c7b3f4a3]
   29  python(+0x2375d5) [0x5571c7b405d5]
   30  python(_Py_UnixMain+0x3c) [0x5571c7b406fc]
   31  /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf0) [0x7f29dcb7a840]
   32  python(+0x1dc3c0) [0x5571c7ae53c0]
===================
INFO:root:Model created
DEBUG:root:Random seed set to 100
INFO:root:Begin process dataset......
INFO:root:args.num_buckets: 1, num_workers: 8, rank: 5
INFO:root:400 files are found.
INFO:root:Model created
DEBUG:root:Random seed set to 100
INFO:root:Begin process dataset......
INFO:root:args.num_buckets: 1, num_workers: 8, rank: 2
INFO:root:400 files are found.
--------------------------------------------------------------------------
mpirun noticed that process rank 4 with PID 0 on node node123 exited on signal 11 (Segmentation fault).

您好,我们没有遇到过相同的错误。不过看报错信息: Segmentation fault: address not mapped to object at address 0x30 像是内存访问越界相关的问题?可以去mxnet官方issue去看看有没有类似的信息。

Flowingsun007 avatar Jan 30 '21 11:01 Flowingsun007

您好,我们没有遇到过相同的错误。不过看报错信息: Segmentation fault: address not mapped to object at address 0x30 像是内存访问越界相关的问题?可以去mxnet官方issue去看看有没有类似的信息。

好的,多谢~

yangshuo0323 avatar Jan 31 '21 09:01 yangshuo0323