terminate called after throwing an instance of 'std::runtime_error'

Open HalFTeen opened this issue 2 years ago • 0 comments

my env: GPU: 2080ti 10G*8 Driver Version: 455.23.05 I get a crash after running: ./bin/multi_gpu_gpt_example according to gpt_guide.md. my action:

cmake -DSM=75 -DCMAKE_BUILD_TYPE=Release -DBUILD_MULTI_GPU=ON ..
make -j12
pip install -r ../examples/pytorch/gpt/requirement.txt
wget https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-vocab.json -P ../models
wget https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-merges.txt -P ../models
git clone https://huggingface.co/gpt2-xl
python ../examples/pytorch/gpt/utils/huggingface_gpt_convert.py -i gpt2-xl/ -o ../models/huggingface-models/c-model/gpt2-xl -i_g 1
./bin/gpt_gemm 8 1 32 25 64 6400 50257 0 1 0
./bin/multi_gpu_gpt_example then I get the crash:

Total ranks: 1.
P0 is running with 0 GPU.
Device GeForce RTX 2080 Ti
[FT][WARNING] Async cudaMalloc/Free is not supported before CUDA 11.2. Using Sync cudaMalloc/Free.Note this may lead to hang with NCCL kernels launched in parallel; if so, try NCCL_LAUNCH_MODE=GROUP
[FT][WARNING] Skip NCCL initialization since requested tensor/pipeline parallel sizes are equals to 1.
[FT][WARNING] file ../models/huggingface-models/c-model/gpt2-xl/1-gpu//model.prompt_table.intent_and_slot.weight.bin cannot be opened, loading model fails! 

[FT][WARNING] file ../models/huggingface-models/c-model/gpt2-xl/1-gpu//model.prompt_table.sentiment.weight.bin cannot be opened, loading model fails! 

[FT][WARNING] file ../models/huggingface-models/c-model/gpt2-xl/1-gpu//model.prompt_table.squad.weight.bin cannot be opened, loading model fails! 

after allocation    : free:  9.63 GB, total: 10.76 GB, used:  1.13 GB
terminate called after throwing an instance of 'std::runtime_error'
  what():  [FT][ERROR] CUDA runtime error: invalid argument /data/wangjie/code/github/FasterTransformer/src/fastertransformer/utils/memory_utils.cu:113 

[server40:134837] *** Process received signal ***
[server40:134837] Signal: Aborted (6)
[server40:134837] Signal code:  (-6)
[server40:134837] [ 0] /lib64/libpthread.so.0(+0xf6d0)[0x7ff3797e66d0]
[server40:134837] [ 1] /lib64/libc.so.6(gsignal+0x37)[0x7ff378d24277]
[server40:134837] [ 2] /lib64/libc.so.6(abort+0x148)[0x7ff378d25968]
[server40:134837] [ 3] /lib64/libstdc++.so.6(_ZN9__gnu_cxx27__verbose_terminate_handlerEv+0xbc)[0x7ff39d9253df]
[server40:134837] [ 4] /lib64/libstdc++.so.6(+0x9cb16)[0x7ff39d923b16]
[server40:134837] [ 5] /lib64/libstdc++.so.6(+0x9cb4c)[0x7ff39d923b4c]
[server40:134837] [ 6] /lib64/libstdc++.so.6(__cxa_rethrow+0x0)[0x7ff39d923d28]
[server40:134837] [ 7] ./bin/multi_gpu_gpt_example[0x9041da]
[server40:134837] [ 8] ./bin/multi_gpu_gpt_example[0x478a04]
[server40:134837] [ 9] ./bin/multi_gpu_gpt_example[0x4314f1]
[server40:134837] [10] ./bin/multi_gpu_gpt_example[0x407c1f]
[server40:134837] [11] /lib64/libc.so.6(__libc_start_main+0xf5)[0x7ff378d10445]
[server40:134837] [12] ./bin/multi_gpu_gpt_example[0x42b157]
[server40:134837] *** End of error message ***
已放弃(吐核)

Any suggestion is welcome. thanks.

Sep 19 '23 03:09 HalFTeen