coder4nlp comments

Results 23 comments of


                                            coder4nlp

The inference time of qwen2.5-vl is very slow.

@kzjeef Would you be able to assist in resolving these matters? Thanks

The inference time of qwen2.5-vl is very slow.

Qwen/Qwen2-VL-2B > Sure, I will test this in my local. > > what's your the model size in your test? And what's the GPU type? Hello,the models I used are...

The inference time of qwen2.5-vl is very slow.

@kzjeef. Thank you for your test results. Please check the previous reply. Concurrency is 10. I have already provided the complete test code.

The inference time of qwen2.5-vl is very slow.

@kzjeef. Could you please tell me how to set up "with vit cache"?

The inference time of qwen2.5-vl is very slow.

When I make a request and it's a concurrent operation, dashinfer takes 10 seconds.

The inference time of qwen2.5-vl is very slow.

@kzjeef . Without considering multiple requests, using a single sample, dashinfer was also extremely slow in my tests. I have no idea what the reason is.

The inference time of qwen2.5-vl is very slow.

@kzjeef As the log is too long, I have placed it in the attachment. [server.txt](https://github.com/user-attachments/files/21404704/server.txt)

The inference time of qwen2.5-vl is very slow.

``` [StopRequest] Request ID: 00000000000000000000000000000192, Context time(ms): 46, Generate time(ms): 8121, Context Length: 383, Generated Length: 147, Context TPS: 8308.03, Generate TPS: 18.101, Prefix Cache Len: 0 ```

The inference time of qwen2.5-vl is very slow.

@kzjeef When I updated the version of dashinfer from 2.0.0 to 2.1.0, the running time of a single request decreased from 10 seconds to 1 second. However, vllm only took...

The inference time of qwen2.5-vl is very slow.

in vllm **Prefix cache hit rate: 99.5%** ``` [loggers.py:111] Engine 000: Avg prompt throughput: 2166.9 tokens/s, Avg generation throughput: 470.3 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache...