lmdeploy [Bug] PyTorch Engine poor performance compared to vllm

Checklist

[x] 1. I have searched related issues but cannot get the expected help.
[x] 2. The bug has not been fixed in the latest version.

Describe the bug

I tried to benchmark the PyTorch Engine performance and find it very poor...

PyTorch Engine concurrency: 4 input token Throughput: 101.53 tokens/s output token Throughput: 93.32 tokens/s total token Throughput: 194.85 tokens/s

vllm concurrency: 4 input token Throughput: 184.18 tokens/s output token Throughput: 169.28 tokens/s total token Throughput: 353.46 tokens/s

Is it normal? Do I miss something when use PyTorch Engine?

Reproduction

model: Qwen14B GPU: A100 LMDeploy: 0.3.0 Dataset: ShareGPT_V3_unfiltered_cleaned_split.json profile_restful_api.py

Environment

LMDeploy: 0.3.0

Error traceback

No response

Apr 18 '24 02:04 jjjjohnson

I also find that the performance of pytorch backend is about 50% of turbomind backend in https://github.com/InternLM/lmdeploy/issues/1370

The the performance of LMDeploy ( turbomind backend ) and VLLM is comparable， and LMDeploy is even better in fact.

Apr 18 '24 03:04 wanzhenchn

concurrency: 4 is too small for benchmark. It's better to enable a large concurrency. Qwen has not been fully optimized, we have not apply custom kernel on rotary embedding. this pr replace apply_rotary_pos_emb with our custom kernel. Please have a try.

# server
lmdeploy serve api_server \
   /path/to/Qwen-14B-Chat \
    --server-port 23333 \
    --backend pytorch \
    --cache-max-entry-count 0.95 \
    --max-batch-size 256

# client
python3 \
    lmdeploy/benchmark/profile_restful_api.py \
    http://0.0.0.0:23333 \
   /path/to/Qwen-14B-Chat \
    ShareGPT_V3_unfiltered_cleaned_split.json \
    --num_prompts 3000 \
    --concurrency 256

performance

concurrency: 256
elapsed_time: 491.255s

number of prompt tokens: 680073
number of completion tokens: 620970
token throughput (completion token): 1264.047 token/s
token throughput (prompt + completion token): 2648.405 token/s
RPS (request per second): 6.107 req/s
RPM (request per minute): 366.408 req/min

Apr 18 '24 05:04 grimoire

Thanks @grimoire . My usage is low concurrency so it is important to see if it is fast enouth under concurrency: 4.

Apr 18 '24 06:04 jjjjohnson