Inference time for Mixtral-8x7B model is slowing down with every new request

Open punkerpunker opened this issue 1 year ago • 0 comments

System Info

GPUs: 2xA100 PCI-e

Who can help?

@kaiyux

Information

[X] The official example scripts
[ ] My own modified scripts

Tasks

[X] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[X] My own task or dataset (give details below)

Reproduction

Using the sources from the branch corresponding to the v0.7.1 tag

Building the model:

python ../llama/build.py --model_dir ./Mixtral-8x7B-v0.1 \
                --use_inflight_batching \
                --enable_context_fmha \
                --use_gemm_plugin \
                --world_size 2 \
                --pp_size 2 \
                --output_dir ./trt_engines/mixtral/PP

Following steps from here to pack it into triton inference server.

Sending such requests:

headers = {
    "Content-Type": "application/json",
}

data = {
    "text_input": "Generate a random text up the max number of new tokens", 
    "max_tokens": 300, 
    "bad_words": "", 
    "stop_words": ""
}

response = requests.post('.../v2/models/tensorrt_llm_bls/generate', headers=headers, json=data)

Expected behavior

TensorRT gives better performance than TGI (~2.5RPS for quantized model with 300 output tokens)

actual behavior

glebvazhenin@RYG7YPT4W7 ~ % k6 run script.js   

          /\      |‾‾| /‾‾/   /‾‾/   
     /\  /  \     |  |/  /   /  /    
    /  \/    \    |     (   /   ‾‾\  
   /          \   |  |\  \ |  (‾)  | 
  / __________ \  |__| \__\ \_____/ .io

     execution: local
        script: script.js
        output: -

     scenarios: (100.00%) 1 scenario, 250 max VUs, 8m30s max duration (incl. graceful stop):
              * contacts: Up to 6.00 iterations/s for 8m0s over 3 stages (maxVUs: 250, gracefulStop: 30s)

INFO[0014] String input: Generate a random text up the max number of new tokens
INFO[0014] Status: 200                                   source=console
INFO[0014] Response time: 5970.938 ms                    source=console
INFO[0014] Generated tokens: ...
INFO[0020] Status: 200                                   source=console
INFO[0020] Response time: 8742.619 ms                    source=console
INFO[0020] Generated tokens: ...
INFO[0026] Status: 200                                   source=console
INFO[0026] Response time: 12220.316 ms                   source=console
INFO[0026] Generated tokens: ...
INFO[0032] Status: 200                                   source=console
INFO[0032] Response time: 16089.603 ms                   source=console
INFO[0032] Generated tokens: ...
INFO[0037] Status: 200                                   source=console
INFO[0037] Response time: 20116.343 ms                   source=console
INFO[0037] Generated tokens: ...
INFO[0043] Status: 200                                   source=console
INFO[0043] Response time: 24414.768 ms                   source=console
INFO[0043] Generated tokens: ...
INFO[0049] Status: 200                                   source=console
INFO[0049] Response time: 28801.644 ms                   source=console
INFO[0049] Generated tokens: ...
INFO[0055] Status: 200                                   source=console
INFO[0055] Response time: 33385.066 ms                   source=console
INFO[0055] Generated tokens: ...

As you may see, the response time is quickly increasing. Moreover, it won't fall back after the requests are processed, it feels like they're stuck within the model until the container with triton is restarted. Also, the GPU voltage stays high after that load (even after some time the load is released).

additional notes

Not sure if that's a triton problem or a TensorRT-LLM though. Any pointer to take a look at would be much appreciated. Thanks!

Feb 18 '24 00:02 punkerpunker