mT5 with fp16 RuntimeError: [FT][ERROR] CUDA runtime error: an illegal memory access was encountered FasterTransformer_v5.1/src/fastertransformer/utils/memory_utils.cu:96
Description
branch: v5.1
docker_image: nvidia/pytorch:21.11-py3
gpu: T4
Error:
Traceback (most recent call last):
File "../examples/pytorch/t5/summarization.py", line 382, in <module>
main()
File "../examples/pytorch/t5/summarization.py", line 289, in main
summary_ft, _ = summarize_ft(datapoint)
File "../examples/pytorch/t5/summarization.py", line 242, in summarize_ft
output, ft_output_len = ft_t5(line_tokens,
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/data/mt/hbl/FasterTransformer_v5.1/examples/pytorch/t5/../../../examples/pytorch/t5/utils/ft_decoding.py", line 355, in forward
results = self.decoding.forward(beam_size, # optional, can be None
File "/data/mt/hbl/FasterTransformer_v5.1/examples/pytorch/t5/../../../examples/pytorch/t5/utils/ft_decoding.py", line 329, in forward
results = self.decoding.forward(beam_width, max_seq_len,
RuntimeError: [FT][ERROR] CUDA runtime error: an illegal memory access was encountered /data/mt/hbl/FasterTransformer_v5.1/src/fastertransformer/utils/memory_utils.cu:96
Reproduced Steps
1. docker pull nvcr.io/nvidia/pytorch:21.11-py3
2. docker run -e NVIDIA_VISIBLE_DEVICES=0 --name ft_v5 -p 5222:22 -p 5280:8080 -p 5230:8030 -v /data/:/data/ -itd hub.cloud.ctripcorp.com/nvidia/pytorch:21.11-py3
3. python3 ../examples/pytorch/t5/utils/huggingface_t5_ckpt_convert.py \
-saved_dir /data/mt/hbl/models/mt5/mt5_base/c-models \
-in_file /data/mt/hbl/models/mt5/mt5_base/ \
-inference_tensor_para_size 1 \
-weight_data_type fp16
4. python3 ../examples/pytorch/t5/summarization.py \
--ft_model_location ${model_path}/c-models/ \
--hf_model_location ${model_path}/ \
--test_ft \
--test_hf \
--data_type fp16
Can you try the latest main branch?
I am using the main branch
Can you try the latest main branch?
You say you use v5.1 at the beginning, do you mean that you have tested on both v5.1 and main?
You say you use v5.1 at the beginning, do you mean that you have tested on both v5.1 and main?
Sorry, I misunderstood, I tested on main not v5.1
I have verified and make sure the main branch works well. Can you
- Make sure you pull the latest codes. And
- Provide the script to build the project
I have verified and make sure the
mainbranch works well. Can you
- Make sure you pull the latest codes. And
- Provide the script to build the project
After updating the latest code, the problem is solved, thank you very much