FasterTransformer
FasterTransformer copied to clipboard
CUBLAS_STATUS_INTERNAL_ERROR for OPT-13b when input token length > 230 (with generate token length = 20)
Description
Master Branch, V100 GPU.
GPU Driver Version: 470.82.01 CUDA Version: 11.7
Reproduced Steps
1. download opt-13b weight from huggingface.
2. convert the weight to FT weight.
3. get gemm_config.in by "../../../build/bin/gpt_gemm 8 1 500 40 128 20480 50272 1 2"
3. modify the opt_summarization.py for prompt length == 230, max_len = 20, beam = 1, batch_size=8
4. get error
File "/mnt/noll/pytorch/torch/nn/modules/module.py", line 1190, in _call_impl
return forward_call(*input, **kwargs)
File "/mnt/noll/FasterTransformer/examples/pytorch/gpt/utils/gpt.py", line 406, in forward
outputs = self.model.forward(start_ids,
RuntimeError: [FT][ERROR] Assertion fail: /mnt/noll/FasterTransformer/src/fastertransformer/th_op/multi_gpu_gpt/ParallelGptOp.h:335
[FT][ERROR] [FT][ERROR] CUDA runtime error: CUBLAS_STATUS_INTERNAL_ERROR /mnt/noll/FasterTransformer/src/fastertransformer/utils/cublasMMWrapper.cc:108
Traceback (most recent call last):
File "yx_test_opt.py", line 228, in <module>
main()
File "yx_test_opt.py", line 220, in main
summary, _ = summarize_ft(None)
File "yx_test_opt.py", line 200, in summarize_ft
output, ft_output_len = gpt(line_encoded, torch.IntTensor([len(line_encoded[0])]),
File "/mnt/noll/pytorch/torch/nn/modules/module.py", line 1190, in _call_impl
return forward_call(*input, **kwargs)
File "/mnt/noll/FasterTransformer/examples/pytorch/gpt/utils/gpt.py", line 406, in forward
outputs = self.model.forward(start_ids,
RuntimeError: [FT][ERROR] Assertion fail: /mnt/noll/FasterTransformer/src/fastertransformer/th_op/multi_gpu_gpt/ParallelGptOp.h:335
5. if i set batch size to 1, then I got the error:
an illegal memory access was encountered /mnt/noll/FasterTransformer/src/fastertransformer/utils/memory_utils.cu:96
Please provide the reproduced steps, including docker, building scripts, converting scripts and how you modify the opt_summarization.py. Please don't say "convert the weight to FT weight", but provide the script you use.
Besides, your script for gpt_gemm assume your input length is 500. Also, I don't see tensor parallelism when you run opt_summarization.py, but you set tensor_para_size as 2 when you run gpt_gemm.
I fix the problem by add padding tokens to avoid the specific length. So I'll close the issue.