add tensorRT model worker

Open WHDY opened this issue 1 year ago • 1 comments

Why are these changes needed?

TensorRT-LLM can greatly improve the inference speed of LLM. It would be helpful to support tensorRT-LLM in Fastchat.

This commit simply implements how to use the tensorRT engine to provide API services in Fastchat, users need to convert their LLMs to tensorRT engines by themselves before running the trt_model_worker.py .

Related issue number (if applicable)

issue #2595

Checks

[x] I've run format.sh to lint the changes in this PR.
[x] I've included any doc changes needed.
[x] I've made sure the relevant tests are passing (if applicable).

Mar 08 '24 04:03 WHDY

Why are these changes needed?

TensorRT-LLM can greatly improve the inference speed of LLM. It would be helpful to support tensorRT-LLM in Fastchat.

This commit simply implements how to use the tensorRT engine to provide API services in Fastchat, users need to convert their LLMs to tensorRT engines by themselves before running the trt_model_worker.py .

Related issue number (if applicable)

issue #2595

Checks

[x] I've run format.sh to lint the changes in this PR.

[x] I've included any doc changes needed.

[x] I've made sure the relevant tests are passing (if applicable).

i try this PR successfully, but i wonder if it supports In-flight Sequence Batching?

Apr 29 '24 12:04 AGI-player