result is different from 0.9.0 and 0.10.0,and speed has decreased when update version
System Info
CPU X86 GPU A100 OS Redhat Driver 535.154.05
Who can help?
No response
Information
- [X] The official example scripts
- [ ] My own modified scripts
Tasks
- [X] An officially supported task in the
examplesfolder (such as GLUE/SQuAD, ...) - [ ] My own task or dataset (give details below)
Reproduction
i use the same model:vicuna-7b-v1.3 medusa-vicuna-7b-v1.3, when i use version of 0.9.0 with image: nvidia/cuda:12.1.0 , input '
Once upon' ,response and speed of output token like:
but i update version to 0.10.0 and use image 12.4.0, response is changed and speed decreased. like:
and i just use vllm to use the same model, and i can get the same response with version of 0.9.0, why update version the result has changed and speed decreased? thanks~
i noticed the differences between the two version is temperature,0.9.0 use tem=0.0 , 0.10.0 use tem=1.0
Expected behavior
update version ,speed should be imporved or remain consistent with old version. and model result should not changed.
actual behavior
update version ,result is different . and speed slowed down.
additional notes
as Reproduction
I see the same issue with Llama-3 70B, v0.10.0 engine runs 0.5-1.5 seconds slower than the same engine in v0.9.0.
@sundayKK, please try to use the latest version of TrtLLM.