TensorRT-LLM icon indicating copy to clipboard operation
TensorRT-LLM copied to clipboard

Llama-2 13B SmoothQuant W8A8 Per-Tensor TP-4 performance is poor in v0.9.0 release

Open vnkc1 opened this issue 1 year ago • 7 comments

System Info

GPUs: A100, 4 GPUs (40 GB memory) Release: tensorrt-llm 0.9.0

Who can help?

@Tracin

Information

  • [X] The official example scripts
  • [ ] My own modified scripts

Tasks

  • [X] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • [ ] My own task or dataset (give details below)

Reproduction

  1. Install tensorrt-llm 0.9.0
  2. Create LLama-2 13B chat, TP-4, SmoothQuant 0.5, Per-Tensor checkpoint and engine
  3. Create LLama-2 13B chat, TP-4, SmoothQuant 0.5, Per-Channel + Per-Token checkpoint and engine
  4. Run mmlu.py

Expected behavior

Similar performance on MMLU between Per-Tensor and Per-Channel + Per-Token

actual behavior

  1. LLama-2 13B chat, SmoothQuant 0.5, TP-4 Per-Channel + Per-Token Average accuracy - 54.52 STEM - 43.31 Humanities - 49.8 Social Science - 62.04 Misc - 60.24

  2. LLama-2 13B chat, SmoothQuant 0.5, TP-4 Per-Tensor Average accuracy - 29.41 STEM - 29.56 Humanities - 25.65 Social Science - 28.31 Misc - 31.77

additional notes

n/a

vnkc1 avatar May 16 '24 21:05 vnkc1

Why do you expect the accuracy of Per-Tensor and Per-Channel + Per-Token are close? It is expected that Per-Channel + Per-Token has higher accuracy.

byshiue avatar May 21 '24 08:05 byshiue

Is a 24% drop in MMLU 5-shot accuracy for Llama-2 13B expected?

ghost avatar May 21 '24 15:05 ghost

It is hard to say it is expected or not because it is related to quantization workflow and model. But the Per-Channel + Per-Token is suggested and can keep the accuracy well. Could you explain why do you want to use Per-Tensor?

byshiue avatar May 23 '24 07:05 byshiue

It is hard to say it is expected or not because it is related to quantization workflow and model. But the Per-Channel + Per-Token is suggested and can keep the accuracy well. Could you explain why do you want to use Per-Tensor?

May I ask how the Per-Token is computed on the fly? Can you please point out where the code is?

Hongbosherlock avatar Jun 03 '24 08:06 Hongbosherlock

Here is an example.

byshiue avatar Jun 06 '24 03:06 byshiue

Here is an example.

As far as I know, per-token is generally used together with SmoothQuant. I noticed that the SmoothQuant plugin includes per-token-plugin. What is the relationship between the per-token plugin code here and the code you referred to? thanks!

Hongbosherlock avatar Jul 15 '24 12:07 Hongbosherlock

The code you refer is used to quantize the input tensor from higher precision to int8 before entering the SmoothQuant GEMM.

byshiue avatar Jul 17 '24 07:07 byshiue