The inference time of qwen2.5-vl is very slow.
Dashinfer is 10 times slower than VLLM. How can this issue be resolved?
what's version you're using ?
and do you aware this issue on qwen2-vl models ?
Also, can you provide your test command?
@kzjeef
dashinfer==2.0.0rc3
dashinfer-vlm==2.3.0
transformers==4.51.3
When using qwen2-vl, the startup failed.
dashinfer_vlm_serve --model /models/Qwen/Qwen2-VL-2B --host 127.0.0.1
Start converting ONNX model!
Loading safetensors checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:01<00:00, 1.12it/s]
DFN_vit.py:459: TracerWarning: Iterating over a tensor might cause the trace to be incorrect. Passing a tensor of different shape won't change the number of iterations executed (and might lead to errors or silently give incorrect results).
for t, h, w in grid_thw:
torch.Size([1, 3])
batch: tensor(1)
Export to ONNX file successfully! The ONNX file stays in /root/.cache/as_model/Qwen2-VL-2B/model.onnx
Start converting TRT engine!
[07/23/2025-14:35:04] [TRT] [I] [MemUsageChange] Init CUDA: CPU +2, GPU +0, now: CPU 853, GPU 8385 (MiB)
[07/23/2025-14:35:12] [TRT] [I] [MemUsageChange] Init builder kernel library: CPU +2973, GPU +752, now: CPU 3903, GPU 9137 (MiB)
[07/23/2025-14:35:13] [TRT] [W] onnx2trt_utils.cpp:374: Your ONNX model has been generated with INT64 weights, while TensorRT does not natively support INT64. Attempting to cast down to INT32.
[07/23/2025-14:35:13] [TRT] [W] onnx2trt_utils.cpp:400: One or more weights outside the range of INT32 was clamped
Succeeded parsing /root/.cache/as_model/Qwen2-VL-2B/model.onnx
[07/23/2025-14:35:13] [TRT] [W] /vision_model/Reshape_2: IShuffleLayer with zeroIsPlaceHolder=true has reshape dimension at position 0 that might or might not be zero. TensorRT resolves it at runtime, but this may cause excessive memory consumption and is usually a sign of a bug in the network.
[07/23/2025-14:35:13] [TRT] [W] /vision_model/Reshape_2: IShuffleLayer with zeroIsPlaceHolder=true has reshape dimension at position 2 that might or might not be zero. TensorRT resolves it at runtime, but this may cause excessive memory consumption and is usually a sign of a bug in the network.
[07/23/2025-14:35:13] [TRT] [W] /vision_model/Reshape_4: IShuffleLayer with zeroIsPlaceHolder=true has reshape dimension at position 0 that might or might not be zero. TensorRT resolves it at runtime, but this may cause excessive memory consumption and is usually a sign of a bug in the network.
[07/23/2025-14:35:13] [TRT] [W] /vision_model/Reshape_4: IShuffleLayer with zeroIsPlaceHolder=true has reshape dimension at position 2 that might or might not be zero. TensorRT resolves it at runtime, but this may cause excessive memory consumption and is usually a sign of a bug in the network.
[07/23/2025-14:35:15] [TRT] [I] Graph optimization time: 1.81676 seconds.
[07/23/2025-14:35:15] [TRT] [I] Local timing cache in use. Profiling results in this builder pass will not be stored.
[07/23/2025-14:35:15] [TRT] [W] Was not able to infer a kOPT value for tensor /vision_model/Squeeze_output_0. Using one(s).
[07/23/2025-14:35:15] [TRT] [W] Was not able to infer a kOPT value for tensor /vision_model/ReduceMax_output_0. Using one(s).
[07/23/2025-14:35:18] [TRT] [I] Detected 2 inputs and 1 output network tensors.
[07/23/2025-14:36:17] [TRT] [E] 1: autotuning: CUDA error 2 allocating 687194768222-byte buffer: out of memory
[07/23/2025-14:36:17] [TRT] [E] 1: [codeGenerator.cpp::compileGraph::895] Error Code 1: Myelin (autotuning: CUDA error 2 allocating 687194768222-byte buffer: out of memory)
Traceback (most recent call last):
File "/usr/local/bin/dashinfer_vlm_serve", line 33, in <module>
sys.exit(load_entry_point('dashinfer-vlm', 'console_scripts', 'dashinfer_vlm_serve')())
File "dashinfer_vlm/api_server/server.py", line 685, in main
init()
File "dashinfer_vlm/api_server/server.py", line 94, in init
model_loader.load_model(direct_load=False, load_format="auto")
File "dashinfer_vlm/vl_inference/utils/model_loader.py", line 165, in serialize
onnx_trt_obj.generate_trt_engine(onnxFile, self.vision_model_path)
File "dashinfer_vlm/vl_inference/utils/trt/onnx_to_plan.py", line 195, in generate_trt_engine
raise RuntimeError("Failed building %s" % planFile)
RuntimeError: Failed building /root/.cache/as_model/Qwen2-VL-2B/model.plan
I20250723 14:36:18.483180 222505 as_engine.cpp:330] ~AsEngine called
I20250723 14:36:18.483215 222505 weight_manager.cpp:721] ~WeightManager
I20250723 14:36:18.483223 222505 as_engine.cpp:348] ~AsEngineImpl finished.
I20250723 14:36:18.483337 222627 thread_pool_with_id.h:91] dummy message for wake up.
I20250723 14:36:18.483361 222627 thread_pool_with_id.h:45] Thread Pool with id: 0 Exit!!!
Hi, @kzjeef.Thanks for your response. When using qwen2.5-vl, the service can start normally, but it is extremely slow. Here are my startup commands. However, when the input consists of multiple images, an error will occur.
dashinfer_vlm_serve --model /models/Qwen/Qwen2.5-VL-3B-Instruct --port 8000 --host 127.0.0.1 --vision_engine transformers
def send_request():
start_time = time.time()
response = client.chat.completions.create(
model="qwen/Qwen2.5-VL-3B-Instruct",
messages=messages,
stream=False,
max_completion_tokens=1024,
temperature=0.1,
)
end_time = time.time()
latency = end_time - start_time
return latency
messages=[{
"role": "user",
"content": [
{"type": "text", "text": "Describe the content of the picture"},
{
"type": "image_url",
"image_url": {
"url": "https://farm4.staticflickr.com/3075/3168662394_7d7103de7d_z_d.jpg",
}
},
# {
# "type": "image_url",
# "image_url": {
# "url": "https://farm2.staticflickr.com/1533/26541536141_41abe98db3_z_d.jpg",
# }
# },
],
}]
def benchmark(num_requests, num_workers):
latencies = []
start_time = time.time()
with concurrent.futures.ThreadPoolExecutor(max_workers=num_workers) as executor:
futures = [executor.submit(send_request) for _ in range(num_requests)]
for future in concurrent.futures.as_completed(futures):
latencies.append(future.result())
end_time = time.time()
total_time = end_time - start_time
qps = num_requests / total_time
average_latency = sum(latencies) / len(latencies)
throughput = num_requests * 1024 / total_time
print(f"Total Time: {total_time:.2f} seconds")
print(f"QPS: {qps:.2f}")
print(f"Average Latency: {average_latency:.2f} seconds")
if __name__ == "__main__":
num_requests = 100
num_workers = 10
benchmark(num_requests, num_workers)
@kzjeef. I have fixed the qwen2-vl issue, but the inference time is still very slow.
@kzjeef. I have fixed the qwen2-vl issue, but the inference time is still very slow.
what's vllm's version?
@kzjeef. I have fixed the qwen2-vl issue, but the inference time is still very slow.
what's your --vision_engine parameter in qwen2-vl test?
@kzjeef. I have fixed the qwen2-vl issue, but the inference time is still very slow.
what's your --vision_engine parameter in qwen2-vl test?
--vision_engine parameter has not received any parameters vllm==0.8.5.post1
@kzjeef Would you be able to assist in resolving these matters? Thanks
Sure, I will test this in my local.
what's your the model size in your test? And what's the GPU type?
Qwen/Qwen2-VL-2B
Sure, I will test this in my local.
what's your the model size in your test? And what's the GPU type?
Hello,the models I used are Qwen/Qwen2-VL-2B and Qwen/Qwen2.5-VL-3B-Instruct, and the GPU type is H100.
dashinfer_vlm_serve --model /models/Qwen/Qwen2.5-VL-3B-Instruct --port 8000 --host 127.0.0.1 --vision_engine transformers
dashinfer_vlm_serve --model /models/Qwen/Qwen2-VL-2B --host 127.0.0.1
Hi @coder4nlp Here is my test result,
Test Env:
Hardware:
H20 single card
DashInfer
the latest source code.
Dashinfer_vlm
from latest source code.
benchmark command:
basiclly download the data in dash-infer/multimodal/README.md , the data put in tests/data folder:
script:
wget https://huggingface.co/datasets/OpenGVLab/InternVL-Chat-V1-2-SFT-Data/resolve/main/opensource/docvqa_train_10k.jsonl
wget https://huggingface.co/datasets/OpenGVLab/InternVL-Chat-V1-2-SFT-Data/resolve/main/data/share_textvqa.zip
unzip share_textvqa.zip
benchmark command
start the server with port:
python tests/benchmark_openai_api.py --prompt-file tests/data/docvqa_train_10k.jsonl --image-folder tests/data/share_textvqa/images/ --req-nums 100 \
--batch-size 32 \
--image-nums-mean 3 \
--image-nums-range 1 \
--response-mean 120 \
--response-len-range 64
it will run single image test with batch size (concurrencly 32)
Server command:
run the serve under: folder:
dash-infer/multimodal
Because the data in local file, the related path should be accessable, so my file dir is like this for your reference:
dash-infer/multimodal# tree -L 3
Qwen2-VL-2B-Instruct
start command (vit use transformers):
dashinfer_vlm_serve --model /model/Qwen2-VL-2B-Instruct/ --host 127.0.0.1 --vision_engine transformers
Result:
1st time :
Total time: 38.10 sec
input token lens: 2783.60 (average) / 278360 (total) ---
output token lens: 12.78 (average) / 1278 (total) ---
QPS: 2.62 requests/sec, TPS: 33.54 tokens/sec
2nd time (with vit cache):
Total time: 8.72 sec
input token lens: 2783.60 (average) / 278360 (total) ---
output token lens: 10.72 (average) / 1072 (total) ---
QPS: 11.47 requests/sec, TPS: 122.97 tokens/sec
start command (Vit use TRT):
dashinfer_vlm_serve --model /model/Qwen2-VL-2B-Instruct/ --host 127.0.0.1 --vision_engine tensorrt
Result:
1st time :
Total time: 32.99 sec
input token lens: 2783.60 (average) / 278360 (total) --- output token lens: 11.33 (average) / 1133 (total) ---
QPS: 3.03 requests/sec, TPS: 34.35 tokens/sec
2nd time (with vit cache):
Total time: 8.76 sec
input token lens: 2783.60 (average) / 278360 (total) ---
output token lens: 9.37 (average) / 937 (total) ---
QPS: 11.42 requests/sec, TPS: 106.97 tokens/sec
start command (ViT use TRT + FP8 dynamic quant)
dashinfer_vlm_serve --model /cfs_cloud_code/asherszhang/model/Qwen2-VL-2B-Instruct/ --host 127.0.0.1 --vision_engine tensorrt --quant-type fp8
result
1st time :
Total time: 29.88 secinput token lens: 2783.60 (average) / 278360 (total) ---
output token lens: 11.52 (average) / 1152 (total) --- QPS: 3.35 requests/sec, TPS: 38.55 tokens/sec
2nd time (with vit cache):
Total time: 6.84 sec
input token lens: 2783.60 (average) / 278360 (total) ---
output token lens: 9.30 (average) / 930 (total) --- QPS: 14.61 requests/sec, TPS: 135.88 tokens/sec
start command (ViT use TRT + FP8 dynamic quant + prefix cache)
dashinfer_vlm_serve --model /cfs_cloud_code/asherszhang/model/Qwen2-VL-2B-Instruct/ --host 127.0.0.1 --vision_engine tensorrt --enable-prefix-cache --quant-type fp8
result
1st time :
Total time: 29.53 sec
input token lens: 2783.60 (average) / 278360 (total) ---
output token lens: 10.95 (average) / 1095 (total) ---
QPS: 3.39 requests/sec, TPS: 37.09 tokens/sec
2nd time (with vit cache):
Total time: 2.81 sec
input token lens: 2783.60 (average) / 278360 (total) ---
output token lens: 11.25 (average) / 1125 (total) --- QPS: 35.64 requests/sec, TPS: 400.96 tokens/sec
Vllm
version:v0.9.2rc2 + (82b8027be6e8f15603cea823e044069cd10c9c62) start command:
uv run vllm serve /model/Qwen2-VL-2B-Instruct/ --limit-mm-per-prompt '{"image":4}' --allowed-local-media-path `my_image_paths`
1st time :
Total time: 33.14 sec
input token lens: 2794.60 (average) / 279460 (total) ---
output token lens: 18.24 (average) / 1824 (total) ---
QPS: 3.02 requests/sec, TPS: 55.03 tokens/sec
2nd time (with vit cache):
Total time: 6.10 sec
input token lens: 2794.60 (average) / 279460 (total) ---
output token lens: 18.80 (average) / 1880 (total) ---
QPS: 16.38 requests/sec, TPS: 307.96 tokens/sec
# Qwen2.5-VL-2B-Instruct
dashinfer_vlm_serve --model /cfs_cloud_code/asherszhang/model/Qwen2.5-VL-3B-Instruct/ --host 127.0.0.1 --vision_engine transformers
dashinfer (vit use transfomers)
1st time :
Total time: 93.34 sec
input token lens: 2783.60 (average) / 278360 (total) ---
output token lens: 19.42 (average) / 1942 (total) ---
QPS: 1.07 requests/sec, TPS: 20.81 tokens/sec
2nd time (with vit cache):
Total time: 16.74 sec
input token lens: 2783.60 (average) / 278360 (total) ---
output token lens: 14.86 (average) / 1486 (total) ---
QPS: 5.97 requests/sec, TPS: 88.75 tokens/sec
vllm start with mm cache
1st time:
Total time: 36.99 sec
input token lens: 2794.60 (average) / 279460 (total) ---
output token lens: 29.55 (average) / 2955 (total) ---
QPS: 2.70 requests/sec, TPS: 79.89 tokens/sec
2nd time (with vit cache)
Total time: 5.64 sec
input token lens: 2794.60 (average) / 279460 (total) ---
output token lens: 28.81 (average) / 2881 (total) ---
QPS: 17.73 requests/sec, TPS: 510.89 tokens/sec
510 token/s generation, it must be some kind of cache. it will takes 3T memory bandwidth , which already greater than H20's hardware limit.
So I start test with disable cache.
vllm start without mm cache
cmd:
uv run vllm serve /model/Qwen2.5-VL-3B-Instruct/ --limit-mm-per-prompt '{"image":4}' --allowed-local-media-path `path-to-data` --no-enable-prefix-caching --disable-mm-preprocessor-cache
1st time:
Total time: 37.25 sec
input token lens: 2794.60 (average) / 279460 (total) ---
output token lens: 29.09 (average) / 2909 (total) ---
QPS: 2.68 requests/sec, TPS: 78.09 tokens/sec
2nd time:
Total time: 31.88 sec
input token lens: 2794.60 (average) / 279460 (total) ---
output token lens: 29.03 (average) / 2903 (total) ---
QPS: 3.14 requests/sec, TPS: 91.05 tokens/sec
After disable cache , the data going to normal.
2.70 QPS vs 2.68 in 1st time request, there is no much different in current version vllm.
@coder4nlp for your test, I think because you're using same prompt + same image, which cause a lots of caching, special mm cache takes effect.
Qwen/Qwen2-VL-2B
Sure, I will test this in my local. what's your the model size in your test? And what's the GPU type?
Hello,the models I used are Qwen/Qwen2-VL-2B and Qwen/Qwen2.5-VL-3B-Instruct, and the GPU type is H100.
dashinfer_vlm_serve --model /models/Qwen/Qwen2.5-VL-3B-Instruct --port 8000 --host 127.0.0.1 --vision_engine transformers dashinfer_vlm_serve --model /models/Qwen/Qwen2-VL-2B --host 127.0.0.1
What's your concurrency in your test ?
@kzjeef. Thank you for your test results. Please check the previous reply. Concurrency is 10. I have already provided the complete test code.
@kzjeef. Could you please tell me how to set up "with vit cache"?
When I make a request and it's a concurrent operation, dashinfer takes 10 seconds.
@kzjeef. Could you please tell me how to set up "with vit cache"?
it's the cache for the image embedding, it will save a lot compute for the image.
in vllm, you can disable by --disable-mm-preprocessor-cache
it will affect the test result a lot, if you're using same image to do test.
In my test, I use 100 images, but still affect a lot.
@kzjeef . Without considering multiple requests, using a single sample, dashinfer was also extremely slow in my tests. I have no idea what the reason is.
@kzjeef . Without considering multiple requests, using a single sample, dashinfer was also extremely slow in my tests. I have no idea what the reason is.
that's not normal, what's your os and cuda version ?
When I make a request and it's a concurrent operation, dashinfer takes 10 seconds.
can you provide a full startup log for dashinfer_vlm ? maybe some error in start up log.
[StopRequest] Request ID: 00000000000000000000000000000192, Context time(ms): 46, Generate time(ms): 8121, Context Length: 383, Generated Length: 147, Context TPS: 8308.03, Generate TPS: 18.101, Prefix Cache Len: 0
[StopRequest] Request ID: 00000000000000000000000000000192, Context time(ms): 46, Generate time(ms): 8121, Context Length: 383, Generated Length: 147, Context TPS: 8308.03, Generate TPS: 18.101, Prefix Cache Len: 0
Here is log from my side:
[StopRequest] Request ID: 00000000000000000000000000000230, Context time(ms): 142, Generate time(ms): 184, Context Length: 2912, Generated Length: 6, Context TPS: 20492.6, Generate TPS: 32.591, Prefix Cache Len: 0
[StopRequest] Request ID: 00000000000000000000000000000228, Context time(ms): 153, Generate time(ms): 847, Context Length: 2980, Generated Length: 11, Context TPS: 19464.4, Generate TPS: 12.9855, Prefix Cache Len: 0
I think I found the some diff between your deploy and my:
This value's default value changed to 128 in further release, but seems your release still 32, it have some affect on generation speed.
But I think the most major affection is the prefill maybe lagging the generation.
Can you catpure the log set this var?
you can enable the decdoer log by this env var:
ALLSPARK_TIME_LOG=1
Here is log in my side:
There is one factor may affect, when there is a prefill running, the decoder will be slower, like:
Decoder Loop Time [TPOT] (ms): 128 running: 5 alloc: 0.206 forward_time: 15.792 reshape: 0.118 gen_frd: 112.087 post_gen: 0.004
but if there is no prefill,
Decoder Loop Time [TPOT] (ms): 13 running: 5 alloc: 0.173 forward_time: 10.565 reshape: 0.125 gen_frd: 2.91 post_gen: 0.001
If running under single request, the time will be
I20250724 17:56:54.189312 2678454 model.cpp:1411] Decoder Loop Time [TPOT] (ms): 4 running: 1 alloc: 0.104 forward_time: 3.585 reshape: 0.027 gen_frd: 0.859 post_gen: 0
I20250724 17:56:54.194442 2678454 model.cpp:974] Stop request with request id: 00000000000000000000000000000232 I20250724 17:56:54.194453 2678454 model.cpp:999] [StopRequest] Request ID: 00000000000000000000000000000232, Context time(ms): 14, Generate time(ms): 238, Context Length: 382, Generated Length: 51, Context TPS: 27092.2, Generate TPS: 214.196, Prefix Cache Len: 0
the decoder only takes 4ms,
it support my guess that major affect is the prefill concurrent running with decdoer.
@kzjeef When I updated the version of dashinfer from 2.0.0 to 2.1.0, the running time of a single request decreased from 10 seconds to 1 second. However, vllm only took 0.46 seconds.
in vllm Prefix cache hit rate: 99.5%
[loggers.py:111] Engine 000: Avg prompt throughput: 2166.9 tokens/s, Avg generation throughput: 470.3 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 99.5%
@kzjeef This is my test result. Based on the experimental results, dashinfer still has a gap in QPS compared to vllm.Could you please tell me what I should do to make the QPS of dashinfer higher than that of vllm? qwen2-vl-2b-instruct QPS
| 框架 | 1 | 2 | 4 | 8 | 10 | 20 | 40 |
|---|---|---|---|---|---|---|---|
| dashinfer | 1.28 | 2.49 | 4.92 | 8.75 | 10.16 | 12.37 | 22.31 |
| vllm | 2.58 | 4.40 | 7.54 | 12.28 | 14.11 | 20.94 | 24.60 |
| vllm(--no-enable-prefix-caching --disable-mm-preprocessor-cache) | 2.45 | 4.27 | 7.29 | 11.76 | 13.81 | 18.78 | 21.75 |
@kzjeef This is my test result. Based on the experimental results, dashinfer still has a gap in QPS compared to vllm.Could you please tell me what I should do to make the QPS of dashinfer higher than that of vllm? qwen2-vl-2b-instruct QPS
框架 1 2 4 8 10 20 40 dashinfer 1.28 2.49 4.92 8.75 10.16 12.37 22.31 vllm 2.58 4.40 7.54 12.28 14.11 20.94 24.60 vllm(--no-enable-prefix-caching --disable-mm-preprocessor-cache) 2.45 4.27 7.29 11.76 13.81 18.78 21.75
Thanks for the test,
I just found two method,
- enable prefix cache, as vllm 's default option,
- dynamic FP8 per-tensor quant for LLM model.
Here are two results:
Test on Qwen2-VL-2B-Instruct:
start command (ViT use TRT + FP8 dynamic quant)
dashinfer_vlm_serve --model /cfs_cloud_code/asherszhang/model/Qwen2-VL-2B-Instruct/ --host 127.0.0.1 --vision_engine tensorrt --quant-type fp8
NOTE: You may needs to check the accuracy after fp8 quantization.
result
1st time :
Total time: 29.88 secinput token lens: 2783.60 (average) / 278360 (total) ---
output token lens: 11.52 (average) / 1152 (total) --- QPS: 3.35 requests/sec, TPS: 38.55 tokens/sec
2nd time (with vit cache):
Total time: 6.84 sec
input token lens: 2783.60 (average) / 278360 (total) ---
output token lens: 9.30 (average) / 930 (total) --- QPS: 14.61 requests/sec, TPS: 135.88 tokens/sec
start command (ViT use TRT + FP8 dynamic quant + prefix cache)
dashinfer_vlm_serve --model /cfs_cloud_code/asherszhang/model/Qwen2-VL-2B-Instruct/ --host 127.0.0.1 --vision_engine tensorrt --enable-prefix-cache --quant-type fp8
result
1st time :
Total time: 29.53 sec
input token lens: 2783.60 (average) / 278360 (total) ---
output token lens: 10.95 (average) / 1095 (total) ---
QPS: 3.39 requests/sec, TPS: 37.09 tokens/sec
2nd time (with vit cache + prefix cache):
Total time: 2.81 sec
input token lens: 2783.60 (average) / 278360 (total) ---
output token lens: 11.25 (average) / 1125 (total) --- QPS: 35.64 requests/sec, TPS: 400.96 tokens/sec
@kzjeef Strangely enough, dashinfe seems to be unstable. Dashinfer is performing much more slowly today than yesterday. I have no idea what the reason is.
@kzjeef Strangely enough, dashinfe seems to be unstable. Dashinfer is performing much more slowly today than yesterday. I have no idea what the reason is.
Any change of these items ?
- Hardware utility (maybe shared with other guys)
- Input Data
- Free GPU VRAM size
dashinfer is written by c++ mostly, it don't factory that may random like GC etc.
@kzjeef When using Qwen2.5-VL-3B-Instruct with --enable-prefix-cache, an error occurs.
File "dash-infer/multimodal/dashinfer_vlm/api_server/server.py", line 684, in main
init()
File "dashinfer_vlm/api_server/server.py", line 143, in init
vl_engine = QwenVl(
File "dash-infer/multimodal/dashinfer_vlm/vl_inference/runtime/qwen_vl.py", line 231, in __init__
self.as_worker = HieAllsparkWorker(as_config)
File "dash-infer/multimodal/dashinfer_vlm/vl_inference/runtime/hie_allspark_worker.py", line 17, in __init__
self.model = AllSparkM6Model(as_model_config)
File "dash-infer/multimodal/dashinfer_vlm/vl_inference/utils/hie_allspark/model_hie_allspark.py", line 63, in __init__
assert status == AsStatus.ALLSPARK_SUCCESS
AssertionError