FasterTransformer
FasterTransformer copied to clipboard
terminate called after throwing an instance of 'std::runtime_error'
my env: GPU: 2080ti 10G*8 Driver Version: 455.23.05 I get a crash after running: ./bin/multi_gpu_gpt_example according to gpt_guide.md. my action:
- cmake -DSM=75 -DCMAKE_BUILD_TYPE=Release -DBUILD_MULTI_GPU=ON ..
- make -j12
- pip install -r ../examples/pytorch/gpt/requirement.txt
- wget https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-vocab.json -P ../models
- wget https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-merges.txt -P ../models
- git clone https://huggingface.co/gpt2-xl
- python ../examples/pytorch/gpt/utils/huggingface_gpt_convert.py -i gpt2-xl/ -o ../models/huggingface-models/c-model/gpt2-xl -i_g 1
- ./bin/gpt_gemm 8 1 32 25 64 6400 50257 0 1 0
- ./bin/multi_gpu_gpt_example then I get the crash:
Total ranks: 1.
P0 is running with 0 GPU.
Device GeForce RTX 2080 Ti
[FT][WARNING] Async cudaMalloc/Free is not supported before CUDA 11.2. Using Sync cudaMalloc/Free.Note this may lead to hang with NCCL kernels launched in parallel; if so, try NCCL_LAUNCH_MODE=GROUP
[FT][WARNING] Skip NCCL initialization since requested tensor/pipeline parallel sizes are equals to 1.
[FT][WARNING] file ../models/huggingface-models/c-model/gpt2-xl/1-gpu//model.prompt_table.intent_and_slot.weight.bin cannot be opened, loading model fails!
[FT][WARNING] file ../models/huggingface-models/c-model/gpt2-xl/1-gpu//model.prompt_table.sentiment.weight.bin cannot be opened, loading model fails!
[FT][WARNING] file ../models/huggingface-models/c-model/gpt2-xl/1-gpu//model.prompt_table.squad.weight.bin cannot be opened, loading model fails!
after allocation : free: 9.63 GB, total: 10.76 GB, used: 1.13 GB
terminate called after throwing an instance of 'std::runtime_error'
what(): [FT][ERROR] CUDA runtime error: invalid argument /data/wangjie/code/github/FasterTransformer/src/fastertransformer/utils/memory_utils.cu:113
[server40:134837] *** Process received signal ***
[server40:134837] Signal: Aborted (6)
[server40:134837] Signal code: (-6)
[server40:134837] [ 0] /lib64/libpthread.so.0(+0xf6d0)[0x7ff3797e66d0]
[server40:134837] [ 1] /lib64/libc.so.6(gsignal+0x37)[0x7ff378d24277]
[server40:134837] [ 2] /lib64/libc.so.6(abort+0x148)[0x7ff378d25968]
[server40:134837] [ 3] /lib64/libstdc++.so.6(_ZN9__gnu_cxx27__verbose_terminate_handlerEv+0xbc)[0x7ff39d9253df]
[server40:134837] [ 4] /lib64/libstdc++.so.6(+0x9cb16)[0x7ff39d923b16]
[server40:134837] [ 5] /lib64/libstdc++.so.6(+0x9cb4c)[0x7ff39d923b4c]
[server40:134837] [ 6] /lib64/libstdc++.so.6(__cxa_rethrow+0x0)[0x7ff39d923d28]
[server40:134837] [ 7] ./bin/multi_gpu_gpt_example[0x9041da]
[server40:134837] [ 8] ./bin/multi_gpu_gpt_example[0x478a04]
[server40:134837] [ 9] ./bin/multi_gpu_gpt_example[0x4314f1]
[server40:134837] [10] ./bin/multi_gpu_gpt_example[0x407c1f]
[server40:134837] [11] /lib64/libc.so.6(__libc_start_main+0xf5)[0x7ff378d10445]
[server40:134837] [12] ./bin/multi_gpu_gpt_example[0x42b157]
[server40:134837] *** End of error message ***
已放弃(吐核)
Any suggestion is welcome. thanks.