Segmentation fault: address not mapped to object at address 0x1eb46b2d3
Description
Tesla K80. Cuda 11.3. CudNN 8.2.
root@abcbe3e329ca:/workspace/FasterTransformer/build# mpirun -n 2 --allow-run-as-root ./bin/gptj_example
Total ranks: 2.
Device NVIDIA Tesla K80
P1 is runing with 1 GPU.
[INFO] Setting tensor_para_size to 2
Device NVIDIA Tesla K80
P0 is runing with 0 GPU.
[INFO] Setting tensor_para_size to 2
[abcbe3e329ca:8770 :0:8770] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x1eb46b2d3)
==== backtrace (tid: 8770) ====
0 /usr/local/ucx/lib/libucs.so.0(ucs_handle_error+0x2a4) [0x7f8ac70cfd24]
1 /usr/local/ucx/lib/libucs.so.0(+0x27eff) [0x7f8ac70cfeff]
2 /usr/local/ucx/lib/libucs.so.0(+0x28234) [0x7f8ac70d0234]
3 /lib/x86_64-linux-gnu/libcuda.so(+0x1796f2) [0x7f8acd4b46f2]
4 /lib/x86_64-linux-gnu/libcuda.so(+0x201a76) [0x7f8acd53ca76]
5 /usr/local/cuda/lib64/libcudart.so.11.0(+0x163a5) [0x7f8ad9a513a5]
6 /usr/local/cuda/lib64/libcudart.so.11.0(cudaMemPoolSetAccess+0x1c6) [0x7f8ad9a900c6]
7 ./bin/gptj_example(+0x1cde6) [0x55a700662de6]
8 ./bin/gptj_example(+0x205d0) [0x55a7006665d0]
9 ./bin/gptj_example(+0xde77) [0x55a700653e77]
10 /usr/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf3) [0x7f8acec850b3]
11 ./bin/gptj_example(+0x1781e) [0x55a70065d81e]
=================================
[abcbe3e329ca:08770] *** Process received signal ***
[abcbe3e329ca:08770] Signal: Segmentation fault (11)
[abcbe3e329ca:08770] Signal code: (-6)
[abcbe3e329ca:08770] Failing at address: 0x2242
[abcbe3e329ca:08770] [ 0] /usr/lib/x86_64-linux-gnu/libpthread.so.0(+0x153c0)[0x7f8acf1b33c0]
[abcbe3e329ca:08770] [ 1] /lib/x86_64-linux-gnu/libcuda.so(+0x1796f2)[0x7f8acd4b46f2]
[abcbe3e329ca:08770] [ 2] /lib/x86_64-linux-gnu/libcuda.so(+0x201a76)[0x7f8acd53ca76]
[abcbe3e329ca:08770] [ 3] /usr/local/cuda/lib64/libcudart.so.11.0(+0x163a5)[0x7f8ad9a513a5]
[abcbe3e329ca:08770] [ 4] /usr/local/cuda/lib64/libcudart.so.11.0(cudaMemPoolSetAccess+0x1c6)[0x7f8ad9a900c6]
[abcbe3e329ca:08770] [ 5] ./bin/gptj_example(+0x1cde6)[0x55a700662de6]
[abcbe3e329ca:08770] [ 6] ./bin/gptj_example(+0x205d0)[0x55a7006665d0]
[abcbe3e329ca:08770] [ 7] ./bin/gptj_example(+0xde77)[0x55a700653e77]
[abcbe3e329ca:08770] [ 8] /usr/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf3)[0x7f8acec850b3]
[abcbe3e329ca:08770] [ 9] ./bin/gptj_example(+0x1781e)[0x55a70065d81e]
[abcbe3e329ca:08770] *** End of error message ***
[abcbe3e329ca:8769 :0:8769] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x2b6382d3)
==== backtrace (tid: 8769) ====
0 /usr/local/ucx/lib/libucs.so.0(ucs_handle_error+0x2a4) [0x7f290729ed24]
1 /usr/local/ucx/lib/libucs.so.0(+0x27eff) [0x7f290729eeff]
2 /usr/local/ucx/lib/libucs.so.0(+0x28234) [0x7f290729f234]
3 /lib/x86_64-linux-gnu/libcuda.so(+0x1796f2) [0x7f290d6816f2]
4 /lib/x86_64-linux-gnu/libcuda.so(+0x201a76) [0x7f290d709a76]
5 /usr/local/cuda/lib64/libcudart.so.11.0(+0x163a5) [0x7f2919c1e3a5]
6 /usr/local/cuda/lib64/libcudart.so.11.0(cudaMemPoolSetAccess+0x1c6) [0x7f2919c5d0c6]
7 ./bin/gptj_example(+0x1cde6) [0x55d3d0de8de6]
8 ./bin/gptj_example(+0x205d0) [0x55d3d0dec5d0]
9 ./bin/gptj_example(+0xde77) [0x55d3d0dd9e77]
10 /usr/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf3) [0x7f290ee520b3]
11 ./bin/gptj_example(+0x1781e) [0x55d3d0de381e]
=================================
[abcbe3e329ca:08769] *** Process received signal ***
[abcbe3e329ca:08769] Signal: Segmentation fault (11)
[abcbe3e329ca:08769] Signal code: (-6)
[abcbe3e329ca:08769] Failing at address: 0x2241
[abcbe3e329ca:08769] [ 0] /usr/lib/x86_64-linux-gnu/libpthread.so.0(+0x153c0)[0x7f290f3803c0]
[abcbe3e329ca:08769] [ 1] /lib/x86_64-linux-gnu/libcuda.so(+0x1796f2)[0x7f290d6816f2]
[abcbe3e329ca:08769] [ 2] /lib/x86_64-linux-gnu/libcuda.so(+0x201a76)[0x7f290d709a76]
[abcbe3e329ca:08769] [ 3] /usr/local/cuda/lib64/libcudart.so.11.0(+0x163a5)[0x7f2919c1e3a5]
[abcbe3e329ca:08769] [ 4] /usr/local/cuda/lib64/libcudart.so.11.0(cudaMemPoolSetAccess+0x1c6)[0x7f2919c5d0c6]
[abcbe3e329ca:08769] [ 5] ./bin/gptj_example(+0x1cde6)[0x55d3d0de8de6]
[abcbe3e329ca:08769] [ 6] ./bin/gptj_example(+0x205d0)[0x55d3d0dec5d0]
[abcbe3e329ca:08769] [ 7] ./bin/gptj_example(+0xde77)[0x55d3d0dd9e77]
[abcbe3e329ca:08769] [ 8] /usr/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf3)[0x7f290ee520b3]
[abcbe3e329ca:08769] [ 9] ./bin/gptj_example(+0x1781e)[0x55d3d0de381e]
[abcbe3e329ca:08769] *** End of error message ***
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 0 on node abcbe3e329ca exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------
Reproduced Steps
nvidia-docker run -ti --rm nvcr.io/nvidia/pytorch:21.07-py3 bash
git clone https://github.com/NVIDIA/FasterTransformer.git
mkdir -p FasterTransformer/build
cd FasterTransformer/build
git submodule init && git submodule update
pip3 install fire jax jaxlib
cmake -DSM=xx -DCMAKE_BUILD_TYPE=Release -DBUILD_MULTI_GPU=ON ..
make -j
wget https://mystic.the-eye.eu/public/AI/GPT-J-6B/step_383500_slim.tar.zstd
tar -axf step_383500_slim.tar.gz
python3 ../examples/pytorch/gptj/utils/gptj_ckpt_convert.py --output-dir ../models/j6b_ckpt --ckpt-dir ./step_383500/ --n-inference-gpus 2
wget https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-vocab.json -P ../models
wget https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-merges.txt -P ../models
./bin/gpt_gemm 1 1 32 16 128 16384 50400 1 1
mpirun -n 2 --allow-run-as-root ./bin/gptj_example
Sorry, we cannot reproduce this problem from our side. Can you try using gdb or add some debug message to find the reason?
(gdb) run
Starting program: /workspace/FasterTransformer/build/bin/gptj_example
warning: Error disabling address space randomization: Operation not permitted
warning: Probes-based dynamic linker interface failed.
Reverting to original interface.
process 9390 is executing new program: /workspace/FasterTransformer/build/bin/gptj_example
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[Detaching after fork from child process 9394]
[New Thread 0x7f1035d91000 (LWP 9398)]
[New Thread 0x7f10353a7000 (LWP 9399)]
Total ranks: 1.
[New Thread 0x7f102fdbb000 (LWP 9400)]
Device NVIDIA Tesla K80
P0 is runing with 0 GPU.
[INFO] Setting tensor_para_size to 1
[New Thread 0x7f102f592000 (LWP 9401)]
[New Thread 0x7f102ebd2000 (LWP 9402)]
[New Thread 0x7f102e3d1000 (LWP 9403)]
[Thread 0x7f102ebd2000 (LWP 9402) exited]
[New Thread 0x7f1023fff000 (LWP 9404)]
[New Thread 0x7f10237fe000 (LWP 9405)]
[Thread 0x7f102e3d1000 (LWP 9403) exited]
[New Thread 0x7f1022ffd000 (LWP 9406)]
[New Thread 0x7f10227fc000 (LWP 9407)]
Thread 1 "gptj_example" received signal SIGSEGV, Segmentation fault.
0x00007fac15af66f2 in ?? () from /lib/x86_64-linux-gnu/libcuda.so
(gdb) backtrace
#0 0x00007f1035f606f2 in ?? () from /lib/x86_64-linux-gnu/libcuda.so
#1 0x00007f1035fe8a76 in ?? () from /lib/x86_64-linux-gnu/libcuda.so
#2 0x00007f10424fa3a5 in ?? () from /usr/local/cuda/lib64/libcudart.so.11.0
#3 0x00007f10425390c6 in cudaMemPoolSetAccess () from /usr/local/cuda/lib64/libcudart.so.11.0
#4 0x000055e32289dde6 in fastertransformer::Allocator<(fastertransformer::AllocatorType)0>::Allocator(int) ()
#5 0x000055e3228a15d0 in void gptj_example<float>(INIReader) ()
#6 0x000055e32288ee77 in main ()
Does /usr/local/cuda/lib64/libcudart.so.11.0 link to other file?
root@abcbe3e329ca:/workspace/FasterTransformer/build# ls -l /usr/local/cuda/lib64/libcudart.so.11.0
lrwxrwxrwx 1 root root 20 May 28 2021 /usr/local/cuda/lib64/libcudart.so.11.0 -> libcudart.so.11.4.43
root@abcbe3e329ca:/workspace/FasterTransformer/build#
How about define the macro CUDA_MEMORY_POOL_DISABLED in allocator.h directly?
I don't understand you.
Change https://github.com/NVIDIA/FasterTransformer/blob/main/src/fastertransformer/utils/allocator.h#L126 to #if 1 directly.
NCCL_LAUNCH_MODE=GROUP mpirun -n 2 --allow-run-as-root ./bin/gptj_example
Total ranks: 2.
Device NVIDIA Tesla K80
P1 is runing with 1 GPU.
[INFO] Setting tensor_para_size to 2
Device NVIDIA Tesla K80
P0 is runing with 0 GPU.
[INFO] Setting tensor_para_size to 2
[FT][WARNING] Async cudaMalloc/Free is not supported before CUDA 11.2. Using Sync cudaMalloc/Free.Note this may lead to hang with NCCL kernels launched in parallel; if so, try NCCL_LAUNCH_MODE=GROUP
terminate called after throwing an instance of 'std::runtime_error'
what(): [FT][ERROR] CUDA runtime error: operation not supported /workspace/FasterTransformer/src/fastertransformer/utils/allocator.h:181
[abcbe3e329ca:13275] *** Process received signal ***
[abcbe3e329ca:13275] Signal: Aborted (6)
[abcbe3e329ca:13275] Signal code: (-6)
[abcbe3e329ca:13275] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x14420)[0x7fc850260420]
[abcbe3e329ca:13275] [ 1] /lib/x86_64-linux-gnu/libc.so.6(gsignal+0xcb)[0x7fc84fd4f00b]
[abcbe3e329ca:13275] [ 2] /lib/x86_64-linux-gnu/libc.so.6(abort+0x12b)[0x7fc84fd2e859]
[abcbe3e329ca:13275] [ 3] /lib/x86_64-linux-gnu/libstdc++.so.6(+0x9e911)[0x7fc850106911]
[abcbe3e329ca:13275] [ 4] /lib/x86_64-linux-gnu/libstdc++.so.6(+0xaa38c)[0x7fc85011238c]
[abcbe3e329ca:13275] [ 5] /lib/x86_64-linux-gnu/libstdc++.so.6(+0xaa3f7)[0x7fc8501123f7]
[abcbe3e329ca:13275] [ 6] /lib/x86_64-linux-gnu/libstdc++.so.6(+0xaa6a9)[0x7fc8501126a9]
[abcbe3e329ca:13275] [ 7] ./bin/gptj_example(+0x19389)[0x56484a036389]
[abcbe3e329ca:13275] [ 8] ./bin/gptj_example(+0x1e1f6)[0x56484a03b1f6]
[abcbe3e329ca:13275] [ 9] ./bin/gptj_example(+0x8de63)[0x56484a0aae63]
[abcbe3e329ca:13275] [10] ./bin/gptj_example(+0x201e1)[0x56484a03d1e1]
[abcbe3e329ca:13275] [11] ./bin/gptj_example(+0xde17)[0x56484a02ae17]
[abcbe3e329ca:13275] [12] terminate called after throwing an instance of 'std::runtime_error'
what(): [FT][ERROR] CUDA runtime error: operation not supported /workspace/FasterTransformer/src/fastertransformer/utils/allocator.h:181
[abcbe3e329ca:13274] *** Process received signal ***
mpirun -n 2 --allow-run-as-root ./bin/gptj_example
Total ranks: 2.
Device NVIDIA Tesla K80
P0 is runing with 0 GPU.
[INFO] Setting tensor_para_size to 2
Device NVIDIA Tesla K80
P1 is runing with 1 GPU.
[INFO] Setting tensor_para_size to 2
[FT][WARNING] Async cudaMalloc/Free is not supported before CUDA 11.2. Using Sync cudaMalloc/Free.Note this may lead to hang with NCCL kernels launched in parallel; if so, try NCCL_LAUNCH_MODE=GROUP
terminate called after throwing an instance of 'std::runtime_error'
what(): [FT][ERROR] CUDA runtime error: operation not supported /workspace/FasterTransformer/src/fastertransformer/utils/allocator.h:181
[abcbe3e329ca:13250] *** Process received signal ***
Also need to change https://github.com/NVIDIA/FasterTransformer/blob/main/src/fastertransformer/utils/allocator.h#L180 and https://github.com/NVIDIA/FasterTransformer/blob/main/src/fastertransformer/utils/allocator.h#L203 to #ifdef 0
It works, but I got a cuda OOM error. Is there any way to load the neural network in parts on multiple GPUs?
I remember the K80 has 24 GB memory, and you use 2-way tensor parallel. It should be able to load the model. Can you post the error?
Yes it does, but two 12gb chips. It feels like the script loads two neural networks on two chips.
Can you post the log?