FasterTransformer icon indicating copy to clipboard operation
FasterTransformer copied to clipboard

mT5 with fp16 RuntimeError: [FT][ERROR] CUDA runtime error: an illegal memory access was encountered FasterTransformer_v5.1/src/fastertransformer/utils/memory_utils.cu:96

Open PAOPAO6 opened this issue 3 years ago • 5 comments

Description

branch: v5.1
docker_image: nvidia/pytorch:21.11-py3 
gpu: T4

Error:
Traceback (most recent call last):
  File "../examples/pytorch/t5/summarization.py", line 382, in <module>
    main()
  File "../examples/pytorch/t5/summarization.py", line 289, in main
    summary_ft, _ = summarize_ft(datapoint)
  File "../examples/pytorch/t5/summarization.py", line 242, in summarize_ft
    output, ft_output_len = ft_t5(line_tokens,
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/data/mt/hbl/FasterTransformer_v5.1/examples/pytorch/t5/../../../examples/pytorch/t5/utils/ft_decoding.py", line 355, in forward
    results = self.decoding.forward(beam_size,  # optional, can be None
  File "/data/mt/hbl/FasterTransformer_v5.1/examples/pytorch/t5/../../../examples/pytorch/t5/utils/ft_decoding.py", line 329, in forward
    results = self.decoding.forward(beam_width, max_seq_len,
RuntimeError: [FT][ERROR] CUDA runtime error: an illegal memory access was encountered /data/mt/hbl/FasterTransformer_v5.1/src/fastertransformer/utils/memory_utils.cu:96

Reproduced Steps

1. docker pull nvcr.io/nvidia/pytorch:21.11-py3

2. docker run -e NVIDIA_VISIBLE_DEVICES=0 --name ft_v5  -p 5222:22 -p 5280:8080 -p 5230:8030 -v /data/:/data/ -itd hub.cloud.ctripcorp.com/nvidia/pytorch:21.11-py3

3. python3 ../examples/pytorch/t5/utils/huggingface_t5_ckpt_convert.py \
        -saved_dir /data/mt/hbl/models/mt5/mt5_base/c-models \
        -in_file /data/mt/hbl/models/mt5/mt5_base/ \
        -inference_tensor_para_size 1 \
        -weight_data_type fp16

4. python3 ../examples/pytorch/t5/summarization.py  \
        --ft_model_location ${model_path}/c-models/ \
        --hf_model_location ${model_path}/ \
        --test_ft \
        --test_hf  \
        --data_type fp16

PAOPAO6 avatar Sep 26 '22 13:09 PAOPAO6

Can you try the latest main branch?

byshiue avatar Sep 27 '22 00:09 byshiue

I am using the main branch

Can you try the latest main branch?

PAOPAO6 avatar Oct 05 '22 06:10 PAOPAO6

You say you use v5.1 at the beginning, do you mean that you have tested on both v5.1 and main?

byshiue avatar Oct 05 '22 08:10 byshiue

You say you use v5.1 at the beginning, do you mean that you have tested on both v5.1 and main?

Sorry, I misunderstood, I tested on main not v5.1

PAOPAO6 avatar Oct 08 '22 09:10 PAOPAO6

I have verified and make sure the main branch works well. Can you

  1. Make sure you pull the latest codes. And
  2. Provide the script to build the project

byshiue avatar Oct 10 '22 01:10 byshiue

I have verified and make sure the main branch works well. Can you

  1. Make sure you pull the latest codes. And
  2. Provide the script to build the project

After updating the latest code, the problem is solved, thank you very much

PAOPAO6 avatar Oct 21 '22 06:10 PAOPAO6