TensorRT-LLM
TensorRT-LLM copied to clipboard
Performance of W4A8 throughput on Hopper GPU.
System Info
Intel(R) Xeon(R) Platinum 8468 NVIDIA H800-80G TensorRT-LLM version 0.12.0
Who can help?
@Tracin @byshiue
Reproduction
I followed the official procedure for LLama2 7b quantization,and compare the throughput of w4a8, FP8 and FP16.
python3 benchmarks/cpp/prepare_dataset.py --stdout --tokenizer meta-llama/Llama-2-7b-hf token-norm-dist --input-mean 128 --output-mean 128 --input-stdev 0 --output-stdev 0 --num-requests 3000 > ./benchmarks/datasets/synthetic_128_128.txt
trtllm-bench --model meta-llama/Llama-2-7b-hf build --tp_size 1 --quantization FP8 --dataset ./benchmarks/datasets/synthetic_128_128.txt
trtllm-bench --model meta-llama/Llama-2-7b-hf throughput --dataset ./benchmarks/datasets/synthetic_128_128.txt --engine_dir /tmp/meta-llama/Llama-2-7b-hf/tp_1_pp_1
Results
Throughput (tokens/sec)
| Model | Input/Output Lengths | TP | FP8 | W4A8_AWQ | FP16 |
|---|---|---|---|---|---|
| llama-2-7b | 128/128 | 1 | 18758 | 10146 | 11116 |
The throughput of W4A8_AWQ is lower than that of FP16 and much lower than that of FP8. Is it caused by the testing method or the lower performance of the computing cores in W4A8_AWQ?