Llama 2 Execution Bug
System Info
CPU: x86_64, memory: 1024GB, GPU: 8*A6000 48GB each, Tensorrt-LLM version 0.9.0.DEV20240226. NVIDIA-Driver Version: 535.171.04 CUDA Version: 12.2; OS - Ubuntu 22.04
Who can help?
No response
Information
- [ ] The official example scripts
- [X] My own modified scripts
Tasks
- [X] An officially supported task in the
examplesfolder (such as GLUE/SQuAD, ...) - [X] My own task or dataset (give details below)
Reproduction
This bug appears when I follow the examples/llama to build the engine with high tp values (4 or 8) and int4 quantization.
First Step: convert_checkpoint
at examples/llama:
python convert_checkpoint.py --model_dir ./tmp/llama/7B/
--output_dir ./tllm_checkpoint_8gpu_tp4_pp2
--dtype float16
--tp_size 4
--pp_size 2
--use_weight_only
--weight_only_precision int4
Second Step: Build engine
trtllm-build --checkpoint_dir ./tllm_checkpoint_8gpu_tp4_pp2
--output_dir ./tmp/llama/7B/trt_engines/fp16/8-gpu/
--gemm_plugin float16 --weight_only_precision --max_batch_size 1
Third Step: Execution using python session mpirun -n 8 python3 ../run.py --max_output_len 40 --input_file 2048.txt --engine_dir ./tmp/llama/7B/trt_engines/fp16/8-gpu/ --tokenizer_dir ./tmp/llama/7B/ --use_py_session
Here, the ../run.py reading 2048.txt contains inputs collected from "theblackcat102/sharegpt-english" dataset.
Expected behavior
The run.py should execute all inputs from file one-by-one
actual behavior
In most cases, it generates the output correctly. However, it gets stuck randomly when a particular input is received; it cannot generate output from runner.generate, and the GPU utilization gets stuck at 100% (normally around 60%). The input that triggers the bug is different every time, so I don't know exactly how it happens.
outputs = runner.generate( batch_input_ids, max_new_tokens=args.max_output_len, max_attention_window_size=args.max_attention_window_size, end_id=end_id, pad_id=pad_id, temperature=args.temperature, top_k=args.top_k, top_p=args.top_p, num_beams=args.num_beams, length_penalty=args.length_penalty, repetition_penalty=args.repetition_penalty, presence_penalty=args.presence_penalty, frequency_penalty=args.frequency_penalty, stop_words_list=stop_words_list, bad_words_list=bad_words_list, lora_uids=args.lora_task_uids, prompt_table_path=args.prompt_table_path, prompt_tasks=args.prompt_tasks, streaming=args.streaming, output_sequence_lengths=True, return_dict=True) torch.cuda.synchronize()
additional notes
This happens for llama 7b and llama 13b with int 4 and tp > 4 usually. But cannot repeat the error exactly using the same input.
Falcon and other model also suffer from same issues when TP level greater than 4 with int4 quantization
The bug occurs with various models and different types of quantization (including float16) when using tp = 4 or tp = 8. Occasionally, SM Utilization spikes to 100% and the system completely freezes. The runner.generate() function never produces any output and GPU utilization remains at 100%.
Could you try adding --use_custom_all_reduce disable during building engine?
Could you try adding
--use_custom_all_reduce disableduring building engine?
The issue still happens when disabling the use_custom_all_reduce. It happens randomly after running hundreds of batch = 1 requests. Each request is done independently so there is no concurrency.
With use_custom_all_reduce disabled, the issue happens less frequently but it doesn't disappear.
A6000s are connected by PCI-e.