TensorRT-LLM icon indicating copy to clipboard operation
TensorRT-LLM copied to clipboard

Empty outputs with TRT engine built with W4A8, FP8 KV cache

Open dhruvmullick opened this issue 1 year ago • 2 comments

System Info

tensorrt_llm 0.12.0.dev2024073000 CUDA 12.4 H100-PCIe

Who can help?

@Tracin @byshiue

Information

  • [ ] The official example scripts
  • [X] My own modified scripts

Tasks

  • [ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • [X] My own task or dataset (give details below)

Reproduction

python3 quantize.py --model_dir meta_llama_3_1_70B_instruct/fp16 \
  --dtype float16 \
  --qformat w4a8_awq \
  --kv_cache_dtype fp8 \
  --awq_block_size 128 \
  --output_dir /tmp/trt_checkpoint \
  --batch_size 8 \
  --calib_size 32
trtllm-build --checkpoint_dir /tmp/trt_checkpoint \
		--gemm_plugin auto \
		--gpt_attention_plugin auto \
		--paged_kv_cache enable \
		--remove_input_padding enable \
		--context_fmha enable \
		--use_fused_mlp \
		--max_seq_len 16000 \
		--max_num_tokens 16384 \
		--max_batch_size 8 \
		--output_dir w4a8_kvfp8 \
		--log_level verbose

Spawned a triton server and made curl request:

curl -X POST localhost:8000/v2/models/ensemble_meta_llama_3_1_70B_instruct/generate -d '{"text_input": "What is machine learning?", "max_tokens": 128, "bad_words": "", "stop_words": "", "pad_id": 128004, "end_id": 128009, "beam_width": 1}'

Expected behavior

Non blank output in "text_output" field. Example from another TRTLLM engine with different quantization:

{"context_logits":0.0,
"cum_log_probs":0.0,
"generation_logits":0.0,
"model_name":"ensemble_meta_llama_3_1_70B_instruct",
"model_version":"1",
"output_log_probs":[0.0,0.0,0.0,0.0,0.0],
"sequence_end":false,
"sequence_id":0,
"sequence_start":false,
"text_output":"assistant\n\nMachine learning is a subfield of artificial intelligence (AI) that involves the use of algorithms and statistical models to enable machines to perform a specific task without using explicit instructions, instead relying on patterns and inference. In traditional programming, a computer is given a set of rules and data, and it follows those rules to produce a result. In contrast, machine learning involves training a model on data, so it can learn the rules and make predictions or decisions on its own.\n\nMachine learning is based on the idea that machines can learn from data and improve their performance on a task over time, without being explicitly programmed for that task. This"}

actual behavior

Blank output in text_output field

{"context_logits":0.0,
"cum_log_probs":0.0,
"generation_logits":0.0,
"model_name":"ensemble_meta_llama_3_1_70B_instruct",
"model_version":"1",
"output_log_probs":[0.0,0.0,0.0,0.0,0.0],
"sequence_end":false,
"sequence_id":0,
"sequence_start":false,
"text_output":""}

additional notes

I see the same issue with w4a8_awq quantization + fp16 kv cache.

However, a model with int4_awq quantization + fp16 kv cache works.

So there must be some issue with w4a8_awq quantization.

dhruvmullick avatar Aug 21 '24 00:08 dhruvmullick

Thanks for the feedback @dhruvmullick, this is a known issue for ModelOpt (which is called in quantize.py). Once it got fixed, we will update here.

Barry-Delaney avatar Sep 04 '24 03:09 Barry-Delaney

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 15 days."

github-actions[bot] avatar Oct 05 '24 02:10 github-actions[bot]

This issue was closed because it has been stalled for 15 days with no activity.

github-actions[bot] avatar Oct 20 '24 02:10 github-actions[bot]

I'm seeing this issue in v13, is there an ETA on the fix? Thanks @Barry-Delaney

vkc1vk avatar Oct 28 '24 02:10 vkc1vk