TensorRT-LLM icon indicating copy to clipboard operation
TensorRT-LLM copied to clipboard

Performance of W4A8 throughput on Hopper GPU.

Open zkf331 opened this issue 1 year ago • 0 comments

System Info

Intel(R) Xeon(R) Platinum 8468 NVIDIA H800-80G TensorRT-LLM version 0.12.0

Who can help?

@Tracin @byshiue

Reproduction

I followed the official procedure for LLama2 7b quantization,and compare the throughput of w4a8, FP8 and FP16.

python3 benchmarks/cpp/prepare_dataset.py --stdout --tokenizer meta-llama/Llama-2-7b-hf token-norm-dist --input-mean 128 --output-mean 128 --input-stdev 0 --output-stdev 0 --num-requests 3000 > ./benchmarks/datasets/synthetic_128_128.txt

trtllm-bench --model meta-llama/Llama-2-7b-hf build --tp_size 1 --quantization FP8 --dataset ./benchmarks/datasets/synthetic_128_128.txt

trtllm-bench --model meta-llama/Llama-2-7b-hf throughput --dataset ./benchmarks/datasets/synthetic_128_128.txt --engine_dir /tmp/meta-llama/Llama-2-7b-hf/tp_1_pp_1

Results

Throughput (tokens/sec)

Model Input/Output Lengths TP FP8 W4A8_AWQ FP16
llama-2-7b 128/128 1 18758 10146 11116

The throughput of W4A8_AWQ is lower than that of FP16 and much lower than that of FP8. Is it caused by the testing method or the lower performance of the computing cores in W4A8_AWQ?

zkf331 avatar Oct 08 '24 13:10 zkf331