DefTruth comments

Results 256 comments of


                                            DefTruth

[Bug]: Qwen1.5-72B L20x8 latest vLLM TPOT slower than v0.4.0.post, 48ms vs 39ms, why?

command： ```bash export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python3 -m vllm.entrypoints.openai.api_server \ --model Qwen1.5-72B-Chat \ --tensor-parallel-size 8 \ --max-model-len 8192 \ --trust-remote-code \ --disable-custom-all-reduce \ --enable-prefix-caching \ --tokenizer-mode slow \ --enforce-eager \ --gpu-memory-utilization 0.9...

[Bug]: Qwen1.5-72B L20x8 latest vLLM TPOT slower than v0.4.0.post, 48ms vs 39ms, why?

> If you can biset to find out the commit that leads to the degradation, that would be helpful. Otherwise, it is very difficult to answer a generic report of...

[Bug]: Qwen1.5-72B L20x8 latest vLLM TPOT slower than v0.4.0.post, 48ms vs 39ms, why?

I have test on L20, not sure the device is same as in CI.

[Bug]: Qwen1.5-72B L20x8 latest vLLM TPOT slower than v0.4.0.post, 48ms vs 39ms, why?

with cuda graph: 35.7ms -> 36.7ms, without cuda graph: 39ms -> 45ms

Support internlm2

@RunningLeon 请问下为什么internlm2需要单独一个convert_checkpoint.py呢？而不是复用llama的convert_checkpoint.py，internlm是直接使用llama的convert_checkpoint.py

Support internlm2

> > @RunningLeon 请问下为什么internlm2需要单独一个convert_checkpoint.py呢？而不是复用llama的convert_checkpoint.py，internlm是直接使用llama的convert_checkpoint.py > > hi, internlm2 W_qkv是在一起的，其次一些参数命名是和llama没有对齐的。因而没法直接使用llama的convert_checkpoint.py Thank you for this explanation!

DefTruth

[Bug]: Qwen1.5-72B L20x8 latest vLLM TPOT slower than v0.4.0.post, 48ms vs 39ms, why?

[Bug]: Qwen1.5-72B L20x8 latest vLLM TPOT slower than v0.4.0.post, 48ms vs 39ms, why?

[Bug]: Qwen1.5-72B L20x8 latest vLLM TPOT slower than v0.4.0.post, 48ms vs 39ms, why?

[Bug]: Qwen1.5-72B L20x8 latest vLLM TPOT slower than v0.4.0.post, 48ms vs 39ms, why?

Support internlm2

Support internlm2

argmax太慢了

[Model] Support PaddleDetection SSD Model

[Model] Support PaddleDetection SSD Model

[Model] Support PaddleDetection SSD Model