coder4nlp
coder4nlp
@kzjeef Would you be able to assist in resolving these matters? Thanks
Qwen/Qwen2-VL-2B > Sure, I will test this in my local. > > what's your the model size in your test? And what's the GPU type? Hello,the models I used are...
@kzjeef. Thank you for your test results. Please check the previous reply. Concurrency is 10. I have already provided the complete test code.
@kzjeef. Could you please tell me how to set up "with vit cache"?
When I make a request and it's a concurrent operation, dashinfer takes 10 seconds.
@kzjeef . Without considering multiple requests, using a single sample, dashinfer was also extremely slow in my tests. I have no idea what the reason is.
@kzjeef As the log is too long, I have placed it in the attachment. [server.txt](https://github.com/user-attachments/files/21404704/server.txt)
``` [StopRequest] Request ID: 00000000000000000000000000000192, Context time(ms): 46, Generate time(ms): 8121, Context Length: 383, Generated Length: 147, Context TPS: 8308.03, Generate TPS: 18.101, Prefix Cache Len: 0 ```
@kzjeef When I updated the version of dashinfer from 2.0.0 to 2.1.0, the running time of a single request decreased from 10 seconds to 1 second. However, vllm only took...
in vllm **Prefix cache hit rate: 99.5%** ``` [loggers.py:111] Engine 000: Avg prompt throughput: 2166.9 tokens/s, Avg generation throughput: 470.3 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache...