TensorRT-LLM Llama 2 Execution Bug

System Info

CPU: x86_64, memory: 1024GB, GPU: 8*A6000 48GB each, Tensorrt-LLM version 0.9.0.DEV20240226. NVIDIA-Driver Version: 535.171.04 CUDA Version: 12.2; OS - Ubuntu 22.04

Who can help?

No response

Information

[ ] The official example scripts
[X] My own modified scripts

Tasks

[X] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[X] My own task or dataset (give details below)

Reproduction

This bug appears when I follow the examples/llama to build the engine with high tp values (4 or 8) and int4 quantization.

First Step: convert_checkpoint at examples/llama: python convert_checkpoint.py --model_dir ./tmp/llama/7B/
--output_dir ./tllm_checkpoint_8gpu_tp4_pp2
--dtype float16
--tp_size 4
--pp_size 2
--use_weight_only
--weight_only_precision int4

Second Step: Build engine trtllm-build --checkpoint_dir ./tllm_checkpoint_8gpu_tp4_pp2
--output_dir ./tmp/llama/7B/trt_engines/fp16/8-gpu/
--gemm_plugin float16 --weight_only_precision --max_batch_size 1

Third Step: Execution using python session mpirun -n 8 python3 ../run.py --max_output_len 40 --input_file 2048.txt --engine_dir ./tmp/llama/7B/trt_engines/fp16/8-gpu/ --tokenizer_dir ./tmp/llama/7B/ --use_py_session

Here, the ../run.py reading 2048.txt contains inputs collected from "theblackcat102/sharegpt-english" dataset.

Expected behavior

The run.py should execute all inputs from file one-by-one

actual behavior

In most cases, it generates the output correctly. However, it gets stuck randomly when a particular input is received; it cannot generate output from runner.generate, and the GPU utilization gets stuck at 100% (normally around 60%). The input that triggers the bug is different every time, so I don't know exactly how it happens. 屏幕截图 2024-04-27 123552

outputs = runner.generate( batch_input_ids, max_new_tokens=args.max_output_len, max_attention_window_size=args.max_attention_window_size, end_id=end_id, pad_id=pad_id, temperature=args.temperature, top_k=args.top_k, top_p=args.top_p, num_beams=args.num_beams, length_penalty=args.length_penalty, repetition_penalty=args.repetition_penalty, presence_penalty=args.presence_penalty, frequency_penalty=args.frequency_penalty, stop_words_list=stop_words_list, bad_words_list=bad_words_list, lora_uids=args.lora_task_uids, prompt_table_path=args.prompt_table_path, prompt_tasks=args.prompt_tasks, streaming=args.streaming, output_sequence_lengths=True, return_dict=True) torch.cuda.synchronize()

additional notes

This happens for llama 7b and llama 13b with int 4 and tp > 4 usually. But cannot repeat the error exactly using the same input.

Apr 27 '24 19:04 Hudayday

Falcon and other model also suffer from same issues when TP level greater than 4 with int4 quantization

Apr 28 '24 16:04 Hudayday

The bug occurs with various models and different types of quantization (including float16) when using tp = 4 or tp = 8. Occasionally, SM Utilization spikes to 100% and the system completely freezes. The runner.generate() function never produces any output and GPU utilization remains at 100%.

Apr 29 '24 23:04 Hudayday

Could you try adding --use_custom_all_reduce disable during building engine?

Apr 30 '24 03:04 byshiue

Could you try adding --use_custom_all_reduce disable during building engine?

The issue still happens when disabling the use_custom_all_reduce. It happens randomly after running hundreds of batch = 1 requests. Each request is done independently so there is no concurrency.

With use_custom_all_reduce disabled, the issue happens less frequently but it doesn't disappear.

A6000s are connected by PCI-e.

Apr 30 '24 06:04 Hudayday