gluon-mxnet-bert多机速度慢问题
简介
horovod是支持pytorch,tensorflow,mxnet多机分布式训练的库,其底层机器间通讯依赖nccl或mpi,所以安装前通常需要先安装好nccl、openmpi,且至少安装了一种深度学习框架,譬如mxnet:
python3 -m pip install gluonnlp==0.10.0 mxnet-cu102mkl==1.6.0.post0 -i https://mirror.baidu.com/pypi/simple
问题解决
安装好依赖后,可以进行horovod的安装,horovod安装时,需为NCCL指定相关变量,否则运行时可能不会走nccl通讯导致速度很慢。详细安装过程:https://github.com/horovod/horovod/blob/master/docs/gpus.rst
- 安装horovod时,需指定NCCL相关变量:
HOROVOD_WITH_MXNET=1 HOROVOD_GPU_OPERATIONS=NCCL HOROVOD_GPU_ALLREDUCE=NCCL HOROVOD_GPU_BROADCAST=NCCL
如果不指定变量直接安装,则用horovodrun时也能运行,不过速度会很慢,因为其底层并未走nccl,直接走的是mpi通信
其他
- 可以在运行时添加--log-level参数为INFO或者DEBUG来查看详细输出
horovodrun -np ${gpu_num} -H ${node_ip} -p ${PORT} \
--start-timeout 600 --log-level INFO \
python3 ${WORKSPACE}/run_pretraining.py ${CMD} 2>&1 | tee ${log_file}
- 通过mpi运行时,可添加参数-x NCCL_DEBUG=INFO查看nccl输出
mpirun -oversubscribe -np ${gpu_num} -H ${node_ip} \
-bind-to none -map-by numa \
-x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x PATH \
-mca pml ob1 -mca btl ^openib \
-mca plm_rsh_args "-p 22 -q -o StrictHostKeyChecking=no" \
-mca btl_tcp_if_include ib0 \
python3 ${WORKSPACE}/run_pretraining.py ${CMD} 2>&1 | tee ${log_file}
训练过程中还发现:
- 1.通过horovodrun时,数据加载时会48core占满100%,正式训练开始不会满(差不多是能有48个core都占用不到50%)
- 2.mpirun -bind-to none时效果同上;mpirun -bind-to core,这种方式可以限制进程数,使得2机数据加载和训练时只会占用16个core,不过速度是比占满时慢一些的(肉眼估算大约30%)
- 3.horovodrun时1组7次训练几乎只有1~2次能正常跑完、mpirun就比较稳基本不会报异常,具体原因未能确定(可能是horovodrun不太稳定或者遭遇端口通信异常)
mpirun参数可见:https://www.open-mpi.org/doc/current/man1/mpirun.1.php
- 通过mpi运行时,可添加参数-x NCCL_DEBUG=INFO查看nccl输出
mpirun -oversubscribe -np ${gpu_num} -H ${node_ip} \ -bind-to none -map-by numa \ -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x PATH \ -mca pml ob1 -mca btl ^openib \ -mca plm_rsh_args "-p 22 -q -o StrictHostKeyChecking=no" \ -mca btl_tcp_if_include ib0 \ python3 ${WORKSPACE}/run_pretraining.py ${CMD} 2>&1 | tee ${log_file}
@Flowingsun007 hi, 看到你用mxnet跑通了多卡训练。我也是想用mxnet在多GPU上训练bert,但是我使用上面这个命令时候会报段错误, 请教下你遇到过吗:
[node123:209584:0:209584] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x30)
==== backtrace ====
0 /usr/lib/libucs.so.0(+0x1fcec) [0x7f26eaf57cec]
1 /usr/lib/libucs.so.0(+0x1ff64) [0x7f26eaf57f64]
2 /lib/x86_64-linux-gnu/libpthread.so.0(pthread_mutex_lock+0x4) [0x7f29dcf2dd44]
3 /home/yangshuo/miniconda3/envs/yangshuo/lib/python3.7/site-packages/mxnet/libmxnet.so(_ZN5mxnet6engine11ThreadedVar21AppendWriteDependencyEPNS0_8OprBlockE+0x44) [0x7f297f1dd564]
4 /home/yangshuo/miniconda3/envs/yangshuo/lib/python3.7/site-packages/mxnet/libmxnet.so(_ZN5mxnet6engine14ThreadedEngine4PushEPNS0_3OprENS_7ContextEib+0x280) [0x7f297f1e0790]
5 /home/yangshuo/miniconda3/envs/yangshuo/lib/python3.7/site-packages/mxnet/libmxnet.so(_ZN5mxnet6engine14ThreadedEngine9PushAsyncESt8functionIFvNS_10RunContextENS0_18CallbackOnCompleteEEENS_7ContextERKSt6vectorIPNS0_3VarESaISA_EESE_NS_10FnPropertyEiPKcb+0x131) [0x7f297f1d8ed1]
6 /home/yangshuo/miniconda3/envs/yangshuo/lib/python3.7/site-packages/mxnet/libmxnet.so(_ZN5mxnet10CopyFromToERKNS_7NDArrayES2_ib+0xaf4) [0x7f297f1b39d4]
7 /home/yangshuo/miniconda3/envs/yangshuo/lib/python3.7/site-packages/horovod/mxnet/mpi_lib.cpython-37m-x86_64-linux-gnu.so(_ZN7horovod5mxnet29PushHorovodOperationCudaOnCPUENS_6common7Request11RequestTypeEPN5mxnet7NDArrayES6_PKcii+0xe6f) [0x7f270014e18f]
8 /home/yangshuo/miniconda3/envs/yangshuo/lib/python3.7/site-packages/horovod/mxnet/mpi_lib.cpython-37m-x86_64-linux-gnu.so(horovod_mxnet_broadcast_async+0x54) [0x7f2700145d84]
9 /home/yangshuo/miniconda3/envs/yangshuo/lib/python3.7/lib-dynload/../../libffi.so.7(+0x69dd) [0x7f29dbf9a9dd]
10 /home/yangshuo/miniconda3/envs/yangshuo/lib/python3.7/lib-dynload/../../libffi.so.7(+0x6067) [0x7f29dbf9a067]
11 /home/yangshuo/miniconda3/envs/yangshuo/lib/python3.7/lib-dynload/_ctypes.cpython-37m-x86_64-linux-gnu.so(_ctypes_callproc+0x2ce) [0x7f29dd18b27e]
12 /home/yangshuo/miniconda3/envs/yangshuo/lib/python3.7/lib-dynload/_ctypes.cpython-37m-x86_64-linux-gnu.so(+0x12cb4) [0x7f29dd18bcb4]
13 python(_PyObject_FastCallKeywords+0x48b) [0x5571c7a7600b]
14 python(_PyEval_EvalFrameDefault+0x51d1) [0x5571c7ada9a1]
15 python(_PyEval_EvalCodeWithName+0x2f9) [0x5571c7a1e2b9]
16 python(_PyFunction_FastCallKeywords+0x387) [0x5571c7a6e497]
17 python(_PyEval_EvalFrameDefault+0x14ea) [0x5571c7ad6cba]
18 python(_PyEval_EvalCodeWithName+0x2f9) [0x5571c7a1e2b9]
19 python(_PyFunction_FastCallKeywords+0x387) [0x5571c7a6e497]
20 python(_PyEval_EvalFrameDefault+0x14ea) [0x5571c7ad6cba]
21 python(_PyFunction_FastCallKeywords+0xfb) [0x5571c7a6e20b]
22 python(_PyEval_EvalFrameDefault+0x416) [0x5571c7ad5be6]
23 python(_PyEval_EvalCodeWithName+0x2f9) [0x5571c7a1e2b9]
24 python(PyEval_EvalCodeEx+0x44) [0x5571c7a1f1d4]
25 python(PyEval_EvalCode+0x1c) [0x5571c7a1f1fc]
26 python(+0x22bf44) [0x5571c7b34f44]
27 python(PyRun_FileExFlags+0xa1) [0x5571c7b3f2b1]
28 python(PyRun_SimpleFileExFlags+0x1c3) [0x5571c7b3f4a3]
29 python(+0x2375d5) [0x5571c7b405d5]
30 python(_Py_UnixMain+0x3c) [0x5571c7b406fc]
31 /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf0) [0x7f29dcb7a840]
32 python(+0x1dc3c0) [0x5571c7ae53c0]
===================
INFO:root:Model created
DEBUG:root:Random seed set to 100
INFO:root:Begin process dataset......
INFO:root:args.num_buckets: 1, num_workers: 8, rank: 5
INFO:root:400 files are found.
INFO:root:Model created
DEBUG:root:Random seed set to 100
INFO:root:Begin process dataset......
INFO:root:args.num_buckets: 1, num_workers: 8, rank: 2
INFO:root:400 files are found.
--------------------------------------------------------------------------
mpirun noticed that process rank 4 with PID 0 on node node123 exited on signal 11 (Segmentation fault).
- 通过mpi运行时,可添加参数-x NCCL_DEBUG=INFO查看nccl输出
mpirun -oversubscribe -np ${gpu_num} -H ${node_ip} \ -bind-to none -map-by numa \ -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x PATH \ -mca pml ob1 -mca btl ^openib \ -mca plm_rsh_args "-p 22 -q -o StrictHostKeyChecking=no" \ -mca btl_tcp_if_include ib0 \ python3 ${WORKSPACE}/run_pretraining.py ${CMD} 2>&1 | tee ${log_file}@Flowingsun007 hi, 看到你用mxnet跑通了多卡训练。我也是想用mxnet在多GPU上训练bert,但是我使用上面这个命令时候会报段错误, 请教下你遇到过吗:
[node123:209584:0:209584] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x30) ==== backtrace ==== 0 /usr/lib/libucs.so.0(+0x1fcec) [0x7f26eaf57cec] 1 /usr/lib/libucs.so.0(+0x1ff64) [0x7f26eaf57f64] 2 /lib/x86_64-linux-gnu/libpthread.so.0(pthread_mutex_lock+0x4) [0x7f29dcf2dd44] 3 /home/yangshuo/miniconda3/envs/yangshuo/lib/python3.7/site-packages/mxnet/libmxnet.so(_ZN5mxnet6engine11ThreadedVar21AppendWriteDependencyEPNS0_8OprBlockE+0x44) [0x7f297f1dd564] 4 /home/yangshuo/miniconda3/envs/yangshuo/lib/python3.7/site-packages/mxnet/libmxnet.so(_ZN5mxnet6engine14ThreadedEngine4PushEPNS0_3OprENS_7ContextEib+0x280) [0x7f297f1e0790] 5 /home/yangshuo/miniconda3/envs/yangshuo/lib/python3.7/site-packages/mxnet/libmxnet.so(_ZN5mxnet6engine14ThreadedEngine9PushAsyncESt8functionIFvNS_10RunContextENS0_18CallbackOnCompleteEEENS_7ContextERKSt6vectorIPNS0_3VarESaISA_EESE_NS_10FnPropertyEiPKcb+0x131) [0x7f297f1d8ed1] 6 /home/yangshuo/miniconda3/envs/yangshuo/lib/python3.7/site-packages/mxnet/libmxnet.so(_ZN5mxnet10CopyFromToERKNS_7NDArrayES2_ib+0xaf4) [0x7f297f1b39d4] 7 /home/yangshuo/miniconda3/envs/yangshuo/lib/python3.7/site-packages/horovod/mxnet/mpi_lib.cpython-37m-x86_64-linux-gnu.so(_ZN7horovod5mxnet29PushHorovodOperationCudaOnCPUENS_6common7Request11RequestTypeEPN5mxnet7NDArrayES6_PKcii+0xe6f) [0x7f270014e18f] 8 /home/yangshuo/miniconda3/envs/yangshuo/lib/python3.7/site-packages/horovod/mxnet/mpi_lib.cpython-37m-x86_64-linux-gnu.so(horovod_mxnet_broadcast_async+0x54) [0x7f2700145d84] 9 /home/yangshuo/miniconda3/envs/yangshuo/lib/python3.7/lib-dynload/../../libffi.so.7(+0x69dd) [0x7f29dbf9a9dd] 10 /home/yangshuo/miniconda3/envs/yangshuo/lib/python3.7/lib-dynload/../../libffi.so.7(+0x6067) [0x7f29dbf9a067] 11 /home/yangshuo/miniconda3/envs/yangshuo/lib/python3.7/lib-dynload/_ctypes.cpython-37m-x86_64-linux-gnu.so(_ctypes_callproc+0x2ce) [0x7f29dd18b27e] 12 /home/yangshuo/miniconda3/envs/yangshuo/lib/python3.7/lib-dynload/_ctypes.cpython-37m-x86_64-linux-gnu.so(+0x12cb4) [0x7f29dd18bcb4] 13 python(_PyObject_FastCallKeywords+0x48b) [0x5571c7a7600b] 14 python(_PyEval_EvalFrameDefault+0x51d1) [0x5571c7ada9a1] 15 python(_PyEval_EvalCodeWithName+0x2f9) [0x5571c7a1e2b9] 16 python(_PyFunction_FastCallKeywords+0x387) [0x5571c7a6e497] 17 python(_PyEval_EvalFrameDefault+0x14ea) [0x5571c7ad6cba] 18 python(_PyEval_EvalCodeWithName+0x2f9) [0x5571c7a1e2b9] 19 python(_PyFunction_FastCallKeywords+0x387) [0x5571c7a6e497] 20 python(_PyEval_EvalFrameDefault+0x14ea) [0x5571c7ad6cba] 21 python(_PyFunction_FastCallKeywords+0xfb) [0x5571c7a6e20b] 22 python(_PyEval_EvalFrameDefault+0x416) [0x5571c7ad5be6] 23 python(_PyEval_EvalCodeWithName+0x2f9) [0x5571c7a1e2b9] 24 python(PyEval_EvalCodeEx+0x44) [0x5571c7a1f1d4] 25 python(PyEval_EvalCode+0x1c) [0x5571c7a1f1fc] 26 python(+0x22bf44) [0x5571c7b34f44] 27 python(PyRun_FileExFlags+0xa1) [0x5571c7b3f2b1] 28 python(PyRun_SimpleFileExFlags+0x1c3) [0x5571c7b3f4a3] 29 python(+0x2375d5) [0x5571c7b405d5] 30 python(_Py_UnixMain+0x3c) [0x5571c7b406fc] 31 /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf0) [0x7f29dcb7a840] 32 python(+0x1dc3c0) [0x5571c7ae53c0] =================== INFO:root:Model created DEBUG:root:Random seed set to 100 INFO:root:Begin process dataset...... INFO:root:args.num_buckets: 1, num_workers: 8, rank: 5 INFO:root:400 files are found. INFO:root:Model created DEBUG:root:Random seed set to 100 INFO:root:Begin process dataset...... INFO:root:args.num_buckets: 1, num_workers: 8, rank: 2 INFO:root:400 files are found. -------------------------------------------------------------------------- mpirun noticed that process rank 4 with PID 0 on node node123 exited on signal 11 (Segmentation fault).
您好,我们没有遇到过相同的错误。不过看报错信息:
Segmentation fault: address not mapped to object at address 0x30
像是内存访问越界相关的问题?可以去mxnet官方issue去看看有没有类似的信息。
您好,我们没有遇到过相同的错误。不过看报错信息:
Segmentation fault: address not mapped to object at address 0x30像是内存访问越界相关的问题?可以去mxnet官方issue去看看有没有类似的信息。
好的,多谢~