TensorRT-LLM result is different from 0.9.0 and 0.10.0，and speed has decreased when update version

System Info

CPU X86 GPU A100 OS Redhat Driver 535.154.05

Who can help?

No response

Information

[X] The official example scripts
[ ] My own modified scripts

Tasks

[X] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

i use the same model:vicuna-7b-v1.3 medusa-vicuna-7b-v1.3, when i use version of 0.9.0 with image: nvidia/cuda:12.1.0 , input ' Once upon' ,response and speed of output token like: 企业微信截图_17206069313952 but i update version to 0.10.0 and use image 12.4.0, response is changed and speed decreased. like: 企业微信截图_17206081811032 and i just use vllm to use the same model, and i can get the same response with version of 0.9.0, why update version the result has changed and speed decreased? thanks~ i noticed the differences between the two version is temperature，0.9.0 use tem=0.0 , 0.10.0 use tem=1.0

Expected behavior

update version ,speed should be imporved or remain consistent with old version. and model result should not changed.

actual behavior

update version ,result is different . and speed slowed down.

additional notes

as Reproduction

Jul 10 '24 10:07 sundayKK

I see the same issue with Llama-3 70B, v0.10.0 engine runs 0.5-1.5 seconds slower than the same engine in v0.9.0.

Jul 16 '24 05:07 ghost

@sundayKK, please try to use the latest version of TrtLLM.

Nov 14 '24 05:11 hello-11