FasterTransformer Segmentation fault: address not mapped to object at address 0x1eb46b2d3

Description

Tesla K80. Cuda 11.3. CudNN 8.2.

root@abcbe3e329ca:/workspace/FasterTransformer/build# mpirun -n 2 --allow-run-as-root ./bin/gptj_example
Total ranks: 2.
Device NVIDIA Tesla K80
P1 is runing with 1 GPU.
[INFO] Setting tensor_para_size to 2
Device NVIDIA Tesla K80
P0 is runing with 0 GPU.
[INFO] Setting tensor_para_size to 2
[abcbe3e329ca:8770 :0:8770] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x1eb46b2d3)
==== backtrace (tid:   8770) ====
 0  /usr/local/ucx/lib/libucs.so.0(ucs_handle_error+0x2a4) [0x7f8ac70cfd24]
 1  /usr/local/ucx/lib/libucs.so.0(+0x27eff) [0x7f8ac70cfeff]
 2  /usr/local/ucx/lib/libucs.so.0(+0x28234) [0x7f8ac70d0234]
 3  /lib/x86_64-linux-gnu/libcuda.so(+0x1796f2) [0x7f8acd4b46f2]
 4  /lib/x86_64-linux-gnu/libcuda.so(+0x201a76) [0x7f8acd53ca76]
 5  /usr/local/cuda/lib64/libcudart.so.11.0(+0x163a5) [0x7f8ad9a513a5]
 6  /usr/local/cuda/lib64/libcudart.so.11.0(cudaMemPoolSetAccess+0x1c6) [0x7f8ad9a900c6]
 7  ./bin/gptj_example(+0x1cde6) [0x55a700662de6]
 8  ./bin/gptj_example(+0x205d0) [0x55a7006665d0]
 9  ./bin/gptj_example(+0xde77) [0x55a700653e77]
10  /usr/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf3) [0x7f8acec850b3]
11  ./bin/gptj_example(+0x1781e) [0x55a70065d81e]
=================================
[abcbe3e329ca:08770] *** Process received signal ***
[abcbe3e329ca:08770] Signal: Segmentation fault (11)
[abcbe3e329ca:08770] Signal code:  (-6)
[abcbe3e329ca:08770] Failing at address: 0x2242
[abcbe3e329ca:08770] [ 0] /usr/lib/x86_64-linux-gnu/libpthread.so.0(+0x153c0)[0x7f8acf1b33c0]
[abcbe3e329ca:08770] [ 1] /lib/x86_64-linux-gnu/libcuda.so(+0x1796f2)[0x7f8acd4b46f2]
[abcbe3e329ca:08770] [ 2] /lib/x86_64-linux-gnu/libcuda.so(+0x201a76)[0x7f8acd53ca76]
[abcbe3e329ca:08770] [ 3] /usr/local/cuda/lib64/libcudart.so.11.0(+0x163a5)[0x7f8ad9a513a5]
[abcbe3e329ca:08770] [ 4] /usr/local/cuda/lib64/libcudart.so.11.0(cudaMemPoolSetAccess+0x1c6)[0x7f8ad9a900c6]
[abcbe3e329ca:08770] [ 5] ./bin/gptj_example(+0x1cde6)[0x55a700662de6]
[abcbe3e329ca:08770] [ 6] ./bin/gptj_example(+0x205d0)[0x55a7006665d0]
[abcbe3e329ca:08770] [ 7] ./bin/gptj_example(+0xde77)[0x55a700653e77]
[abcbe3e329ca:08770] [ 8] /usr/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf3)[0x7f8acec850b3]
[abcbe3e329ca:08770] [ 9] ./bin/gptj_example(+0x1781e)[0x55a70065d81e]
[abcbe3e329ca:08770] *** End of error message ***
[abcbe3e329ca:8769 :0:8769] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x2b6382d3)
==== backtrace (tid:   8769) ====
 0  /usr/local/ucx/lib/libucs.so.0(ucs_handle_error+0x2a4) [0x7f290729ed24]
 1  /usr/local/ucx/lib/libucs.so.0(+0x27eff) [0x7f290729eeff]
 2  /usr/local/ucx/lib/libucs.so.0(+0x28234) [0x7f290729f234]
 3  /lib/x86_64-linux-gnu/libcuda.so(+0x1796f2) [0x7f290d6816f2]
 4  /lib/x86_64-linux-gnu/libcuda.so(+0x201a76) [0x7f290d709a76]
 5  /usr/local/cuda/lib64/libcudart.so.11.0(+0x163a5) [0x7f2919c1e3a5]
 6  /usr/local/cuda/lib64/libcudart.so.11.0(cudaMemPoolSetAccess+0x1c6) [0x7f2919c5d0c6]
 7  ./bin/gptj_example(+0x1cde6) [0x55d3d0de8de6]
 8  ./bin/gptj_example(+0x205d0) [0x55d3d0dec5d0]
 9  ./bin/gptj_example(+0xde77) [0x55d3d0dd9e77]
10  /usr/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf3) [0x7f290ee520b3]
11  ./bin/gptj_example(+0x1781e) [0x55d3d0de381e]
=================================
[abcbe3e329ca:08769] *** Process received signal ***
[abcbe3e329ca:08769] Signal: Segmentation fault (11)
[abcbe3e329ca:08769] Signal code:  (-6)
[abcbe3e329ca:08769] Failing at address: 0x2241
[abcbe3e329ca:08769] [ 0] /usr/lib/x86_64-linux-gnu/libpthread.so.0(+0x153c0)[0x7f290f3803c0]
[abcbe3e329ca:08769] [ 1] /lib/x86_64-linux-gnu/libcuda.so(+0x1796f2)[0x7f290d6816f2]
[abcbe3e329ca:08769] [ 2] /lib/x86_64-linux-gnu/libcuda.so(+0x201a76)[0x7f290d709a76]
[abcbe3e329ca:08769] [ 3] /usr/local/cuda/lib64/libcudart.so.11.0(+0x163a5)[0x7f2919c1e3a5]
[abcbe3e329ca:08769] [ 4] /usr/local/cuda/lib64/libcudart.so.11.0(cudaMemPoolSetAccess+0x1c6)[0x7f2919c5d0c6]
[abcbe3e329ca:08769] [ 5] ./bin/gptj_example(+0x1cde6)[0x55d3d0de8de6]
[abcbe3e329ca:08769] [ 6] ./bin/gptj_example(+0x205d0)[0x55d3d0dec5d0]
[abcbe3e329ca:08769] [ 7] ./bin/gptj_example(+0xde77)[0x55d3d0dd9e77]
[abcbe3e329ca:08769] [ 8] /usr/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf3)[0x7f290ee520b3]
[abcbe3e329ca:08769] [ 9] ./bin/gptj_example(+0x1781e)[0x55d3d0de381e]
[abcbe3e329ca:08769] *** End of error message ***
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 0 on node abcbe3e329ca exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------

Reproduced Steps

nvidia-docker run -ti --rm nvcr.io/nvidia/pytorch:21.07-py3 bash
git clone https://github.com/NVIDIA/FasterTransformer.git
mkdir -p FasterTransformer/build
cd FasterTransformer/build
git submodule init && git submodule update
pip3 install fire jax jaxlib
cmake -DSM=xx -DCMAKE_BUILD_TYPE=Release -DBUILD_MULTI_GPU=ON ..
make -j
wget https://mystic.the-eye.eu/public/AI/GPT-J-6B/step_383500_slim.tar.zstd
tar -axf step_383500_slim.tar.gz
python3 ../examples/pytorch/gptj/utils/gptj_ckpt_convert.py --output-dir ../models/j6b_ckpt --ckpt-dir ./step_383500/ --n-inference-gpus 2
wget https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-vocab.json -P ../models
wget https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-merges.txt -P ../models
./bin/gpt_gemm 1 1 32 16 128 16384 50400 1 1
mpirun -n 2 --allow-run-as-root ./bin/gptj_example

Jul 26 '22 06:07 0x7o

Sorry, we cannot reproduce this problem from our side. Can you try using gdb or add some debug message to find the reason?

Jul 26 '22 06:07 byshiue

(gdb) run
Starting program: /workspace/FasterTransformer/build/bin/gptj_example
warning: Error disabling address space randomization: Operation not permitted
warning: Probes-based dynamic linker interface failed.
Reverting to original interface.
process 9390 is executing new program: /workspace/FasterTransformer/build/bin/gptj_example
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[Detaching after fork from child process 9394]
[New Thread 0x7f1035d91000 (LWP 9398)]
[New Thread 0x7f10353a7000 (LWP 9399)]
Total ranks: 1.
[New Thread 0x7f102fdbb000 (LWP 9400)]
Device NVIDIA Tesla K80
P0 is runing with 0 GPU.
[INFO] Setting tensor_para_size to 1
[New Thread 0x7f102f592000 (LWP 9401)]
[New Thread 0x7f102ebd2000 (LWP 9402)]
[New Thread 0x7f102e3d1000 (LWP 9403)]
[Thread 0x7f102ebd2000 (LWP 9402) exited]
[New Thread 0x7f1023fff000 (LWP 9404)]
[New Thread 0x7f10237fe000 (LWP 9405)]
[Thread 0x7f102e3d1000 (LWP 9403) exited]
[New Thread 0x7f1022ffd000 (LWP 9406)]
[New Thread 0x7f10227fc000 (LWP 9407)]

Thread 1 "gptj_example" received signal SIGSEGV, Segmentation fault.
0x00007fac15af66f2 in ?? () from /lib/x86_64-linux-gnu/libcuda.so

(gdb) backtrace
#0  0x00007f1035f606f2 in ?? () from /lib/x86_64-linux-gnu/libcuda.so
#1  0x00007f1035fe8a76 in ?? () from /lib/x86_64-linux-gnu/libcuda.so
#2  0x00007f10424fa3a5 in ?? () from /usr/local/cuda/lib64/libcudart.so.11.0
#3  0x00007f10425390c6 in cudaMemPoolSetAccess () from /usr/local/cuda/lib64/libcudart.so.11.0
#4  0x000055e32289dde6 in fastertransformer::Allocator<(fastertransformer::AllocatorType)0>::Allocator(int) ()
#5  0x000055e3228a15d0 in void gptj_example<float>(INIReader) ()
#6  0x000055e32288ee77 in main ()

Jul 26 '22 07:07 0x7o

Does /usr/local/cuda/lib64/libcudart.so.11.0 link to other file?

Jul 26 '22 07:07 byshiue

root@abcbe3e329ca:/workspace/FasterTransformer/build# ls -l /usr/local/cuda/lib64/libcudart.so.11.0
lrwxrwxrwx 1 root root 20 May 28  2021 /usr/local/cuda/lib64/libcudart.so.11.0 -> libcudart.so.11.4.43
root@abcbe3e329ca:/workspace/FasterTransformer/build#

Jul 26 '22 07:07 0x7o

How about define the macro CUDA_MEMORY_POOL_DISABLED in allocator.h directly?

Jul 26 '22 07:07 byshiue

I don't understand you.

Jul 26 '22 08:07 0x7o

Change https://github.com/NVIDIA/FasterTransformer/blob/main/src/fastertransformer/utils/allocator.h#L126 to #if 1 directly.

Jul 26 '22 08:07 byshiue

 NCCL_LAUNCH_MODE=GROUP mpirun -n 2 --allow-run-as-root ./bin/gptj_example
Total ranks: 2.
Device NVIDIA Tesla K80
P1 is runing with 1 GPU.
[INFO] Setting tensor_para_size to 2
Device NVIDIA Tesla K80
P0 is runing with 0 GPU.
[INFO] Setting tensor_para_size to 2
[FT][WARNING] Async cudaMalloc/Free is not supported before CUDA 11.2. Using Sync cudaMalloc/Free.Note this may lead to hang with NCCL kernels launched in parallel; if so, try NCCL_LAUNCH_MODE=GROUP
terminate called after throwing an instance of 'std::runtime_error'
  what():  [FT][ERROR] CUDA runtime error: operation not supported /workspace/FasterTransformer/src/fastertransformer/utils/allocator.h:181

[abcbe3e329ca:13275] *** Process received signal ***
[abcbe3e329ca:13275] Signal: Aborted (6)
[abcbe3e329ca:13275] Signal code:  (-6)
[abcbe3e329ca:13275] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x14420)[0x7fc850260420]
[abcbe3e329ca:13275] [ 1] /lib/x86_64-linux-gnu/libc.so.6(gsignal+0xcb)[0x7fc84fd4f00b]
[abcbe3e329ca:13275] [ 2] /lib/x86_64-linux-gnu/libc.so.6(abort+0x12b)[0x7fc84fd2e859]
[abcbe3e329ca:13275] [ 3] /lib/x86_64-linux-gnu/libstdc++.so.6(+0x9e911)[0x7fc850106911]
[abcbe3e329ca:13275] [ 4] /lib/x86_64-linux-gnu/libstdc++.so.6(+0xaa38c)[0x7fc85011238c]
[abcbe3e329ca:13275] [ 5] /lib/x86_64-linux-gnu/libstdc++.so.6(+0xaa3f7)[0x7fc8501123f7]
[abcbe3e329ca:13275] [ 6] /lib/x86_64-linux-gnu/libstdc++.so.6(+0xaa6a9)[0x7fc8501126a9]
[abcbe3e329ca:13275] [ 7] ./bin/gptj_example(+0x19389)[0x56484a036389]
[abcbe3e329ca:13275] [ 8] ./bin/gptj_example(+0x1e1f6)[0x56484a03b1f6]
[abcbe3e329ca:13275] [ 9] ./bin/gptj_example(+0x8de63)[0x56484a0aae63]
[abcbe3e329ca:13275] [10] ./bin/gptj_example(+0x201e1)[0x56484a03d1e1]
[abcbe3e329ca:13275] [11] ./bin/gptj_example(+0xde17)[0x56484a02ae17]
[abcbe3e329ca:13275] [12] terminate called after throwing an instance of 'std::runtime_error'
  what():  [FT][ERROR] CUDA runtime error: operation not supported /workspace/FasterTransformer/src/fastertransformer/utils/allocator.h:181

[abcbe3e329ca:13274] *** Process received signal ***

 mpirun -n 2 --allow-run-as-root ./bin/gptj_example
Total ranks: 2.
Device NVIDIA Tesla K80
P0 is runing with 0 GPU.
[INFO] Setting tensor_para_size to 2
Device NVIDIA Tesla K80
P1 is runing with 1 GPU.
[INFO] Setting tensor_para_size to 2
[FT][WARNING] Async cudaMalloc/Free is not supported before CUDA 11.2. Using Sync cudaMalloc/Free.Note this may lead to hang with NCCL kernels launched in parallel; if so, try NCCL_LAUNCH_MODE=GROUP
terminate called after throwing an instance of 'std::runtime_error'
  what():  [FT][ERROR] CUDA runtime error: operation not supported /workspace/FasterTransformer/src/fastertransformer/utils/allocator.h:181

[abcbe3e329ca:13250] *** Process received signal ***

Jul 26 '22 09:07 0x7o

Also need to change https://github.com/NVIDIA/FasterTransformer/blob/main/src/fastertransformer/utils/allocator.h#L180 and https://github.com/NVIDIA/FasterTransformer/blob/main/src/fastertransformer/utils/allocator.h#L203 to #ifdef 0

Jul 26 '22 09:07 byshiue

It works, but I got a cuda OOM error. Is there any way to load the neural network in parts on multiple GPUs?

Jul 26 '22 11:07 0x7o

I remember the K80 has 24 GB memory, and you use 2-way tensor parallel. It should be able to load the model. Can you post the error?

Jul 26 '22 11:07 byshiue

Yes it does, but two 12gb chips. It feels like the script loads two neural networks on two chips.

Jul 26 '22 11:07 0x7o

Can you post the log?

Jul 26 '22 11:07 byshiue