FasterTransformer CUBLAS_STATUS_INTERNAL_ERROR for OPT-13b when input token length

Description

Master Branch, V100 GPU.
GPU Driver Version: 470.82.01    CUDA Version: 11.7

Reproduced Steps

1. download opt-13b weight from huggingface.
2. convert the weight to FT weight. 
3. get gemm_config.in by "../../../build/bin/gpt_gemm 8 1 500 40 128 20480 50272 1 2"
3. modify the opt_summarization.py for prompt length == 230, max_len = 20, beam = 1, batch_size=8
4. get error

 File "/mnt/noll/pytorch/torch/nn/modules/module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "/mnt/noll/FasterTransformer/examples/pytorch/gpt/utils/gpt.py", line 406, in forward
    outputs = self.model.forward(start_ids,
RuntimeError: [FT][ERROR]  Assertion fail: /mnt/noll/FasterTransformer/src/fastertransformer/th_op/multi_gpu_gpt/ParallelGptOp.h:335 

[FT][ERROR] [FT][ERROR] CUDA runtime error: CUBLAS_STATUS_INTERNAL_ERROR /mnt/noll/FasterTransformer/src/fastertransformer/utils/cublasMMWrapper.cc:108 

Traceback (most recent call last):
  File "yx_test_opt.py", line 228, in <module>
    main()
  File "yx_test_opt.py", line 220, in main
    summary, _ = summarize_ft(None)
  File "yx_test_opt.py", line 200, in summarize_ft
    output, ft_output_len = gpt(line_encoded, torch.IntTensor([len(line_encoded[0])]),
  File "/mnt/noll/pytorch/torch/nn/modules/module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "/mnt/noll/FasterTransformer/examples/pytorch/gpt/utils/gpt.py", line 406, in forward
    outputs = self.model.forward(start_ids,
RuntimeError: [FT][ERROR]  Assertion fail: /mnt/noll/FasterTransformer/src/fastertransformer/th_op/multi_gpu_gpt/ParallelGptOp.h:335 

5. if i set batch size to 1, then I got the error:
    an illegal memory access was encountered /mnt/noll/FasterTransformer/src/fastertransformer/utils/memory_utils.cu:96

Sep 06 '22 12:09 yingapple

Please provide the reproduced steps, including docker, building scripts, converting scripts and how you modify the opt_summarization.py. Please don't say "convert the weight to FT weight", but provide the script you use.

Besides, your script for gpt_gemm assume your input length is 500. Also, I don't see tensor parallelism when you run opt_summarization.py, but you set tensor_para_size as 2 when you run gpt_gemm.

Sep 07 '22 01:09 byshiue

I fix the problem by add padding tokens to avoid the specific length. So I'll close the issue.

Nov 08 '22 04:11 yingapple

CUBLAS_STATUS_INTERNAL_ERROR for OPT-13b when input token length > 230 (with generate token length = 20)

Description

Reproduced Steps