add tensorRT model worker
Why are these changes needed?
TensorRT-LLM can greatly improve the inference speed of LLM. It would be helpful to support tensorRT-LLM in Fastchat.
This commit simply implements how to use the tensorRT engine to provide API services in Fastchat, users need to convert their LLMs to tensorRT engines by themselves before running the trt_model_worker.py .
Related issue number (if applicable)
issue #2595
Checks
- [x] I've run
format.shto lint the changes in this PR. - [x] I've included any doc changes needed.
- [x] I've made sure the relevant tests are passing (if applicable).
Why are these changes needed?
TensorRT-LLM can greatly improve the inference speed of LLM. It would be helpful to support tensorRT-LLM in Fastchat.
This commit simply implements how to use the tensorRT engine to provide API services in Fastchat, users need to convert their LLMs to tensorRT engines by themselves before running the
trt_model_worker.py.Related issue number (if applicable)
issue #2595
Checks
- [x] I've run
format.shto lint the changes in this PR.- [x] I've included any doc changes needed.
- [x] I've made sure the relevant tests are passing (if applicable).
i try this PR successfully, but i wonder if it supports In-flight Sequence Batching?