Inference time for Mixtral-8x7B model is slowing down with every new request
System Info
GPUs: 2xA100 PCI-e
Who can help?
@kaiyux
Information
- [X] The official example scripts
- [ ] My own modified scripts
Tasks
- [X] An officially supported task in the
examplesfolder (such as GLUE/SQuAD, ...) - [X] My own task or dataset (give details below)
Reproduction
Using the sources from the branch corresponding to the v0.7.1 tag
Building the model:
python ../llama/build.py --model_dir ./Mixtral-8x7B-v0.1 \
--use_inflight_batching \
--enable_context_fmha \
--use_gemm_plugin \
--world_size 2 \
--pp_size 2 \
--output_dir ./trt_engines/mixtral/PP
Following steps from here to pack it into triton inference server.
Sending such requests:
headers = {
"Content-Type": "application/json",
}
data = {
"text_input": "Generate a random text up the max number of new tokens",
"max_tokens": 300,
"bad_words": "",
"stop_words": ""
}
response = requests.post('.../v2/models/tensorrt_llm_bls/generate', headers=headers, json=data)
Expected behavior
TensorRT gives better performance than TGI (~2.5RPS for quantized model with 300 output tokens)
actual behavior
glebvazhenin@RYG7YPT4W7 ~ % k6 run script.js
/\ |‾‾| /‾‾/ /‾‾/
/\ / \ | |/ / / /
/ \/ \ | ( / ‾‾\
/ \ | |\ \ | (‾) |
/ __________ \ |__| \__\ \_____/ .io
execution: local
script: script.js
output: -
scenarios: (100.00%) 1 scenario, 250 max VUs, 8m30s max duration (incl. graceful stop):
* contacts: Up to 6.00 iterations/s for 8m0s over 3 stages (maxVUs: 250, gracefulStop: 30s)
INFO[0014] String input: Generate a random text up the max number of new tokens
INFO[0014] Status: 200 source=console
INFO[0014] Response time: 5970.938 ms source=console
INFO[0014] Generated tokens: ...
INFO[0020] Status: 200 source=console
INFO[0020] Response time: 8742.619 ms source=console
INFO[0020] Generated tokens: ...
INFO[0026] Status: 200 source=console
INFO[0026] Response time: 12220.316 ms source=console
INFO[0026] Generated tokens: ...
INFO[0032] Status: 200 source=console
INFO[0032] Response time: 16089.603 ms source=console
INFO[0032] Generated tokens: ...
INFO[0037] Status: 200 source=console
INFO[0037] Response time: 20116.343 ms source=console
INFO[0037] Generated tokens: ...
INFO[0043] Status: 200 source=console
INFO[0043] Response time: 24414.768 ms source=console
INFO[0043] Generated tokens: ...
INFO[0049] Status: 200 source=console
INFO[0049] Response time: 28801.644 ms source=console
INFO[0049] Generated tokens: ...
INFO[0055] Status: 200 source=console
INFO[0055] Response time: 33385.066 ms source=console
INFO[0055] Generated tokens: ...
As you may see, the response time is quickly increasing. Moreover, it won't fall back after the requests are processed, it feels like they're stuck within the model until the container with triton is restarted. Also, the GPU voltage stays high after that load (even after some time the load is released).
additional notes
Not sure if that's a triton problem or a TensorRT-LLM though. Any pointer to take a look at would be much appreciated. Thanks!