dash-infer icon indicating copy to clipboard operation
dash-infer copied to clipboard

The inference time of qwen2.5-vl is very slow.

Open coder4nlp opened this issue 7 months ago • 36 comments

Dashinfer is 10 times slower than VLLM. How can this issue be resolved?

coder4nlp avatar Jul 23 '25 05:07 coder4nlp

what's version you're using ?

and do you aware this issue on qwen2-vl models ?

kzjeef avatar Jul 23 '25 07:07 kzjeef

Also, can you provide your test command?

kzjeef avatar Jul 23 '25 07:07 kzjeef

@kzjeef
dashinfer==2.0.0rc3 dashinfer-vlm==2.3.0 transformers==4.51.3 When using qwen2-vl, the startup failed.

dashinfer_vlm_serve --model /models/Qwen/Qwen2-VL-2B --host 127.0.0.1
Start converting ONNX model!
Loading safetensors checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:01<00:00,  1.12it/s]
DFN_vit.py:459: TracerWarning: Iterating over a tensor might cause the trace to be incorrect. Passing a tensor of different shape won't change the number of iterations executed (and might lead to errors or silently give incorrect results).
  for t, h, w in grid_thw:
torch.Size([1, 3])
batch:  tensor(1)
Export to ONNX file successfully! The ONNX file stays in /root/.cache/as_model/Qwen2-VL-2B/model.onnx
Start converting TRT engine!
[07/23/2025-14:35:04] [TRT] [I] [MemUsageChange] Init CUDA: CPU +2, GPU +0, now: CPU 853, GPU 8385 (MiB)
[07/23/2025-14:35:12] [TRT] [I] [MemUsageChange] Init builder kernel library: CPU +2973, GPU +752, now: CPU 3903, GPU 9137 (MiB)
[07/23/2025-14:35:13] [TRT] [W] onnx2trt_utils.cpp:374: Your ONNX model has been generated with INT64 weights, while TensorRT does not natively support INT64. Attempting to cast down to INT32.
[07/23/2025-14:35:13] [TRT] [W] onnx2trt_utils.cpp:400: One or more weights outside the range of INT32 was clamped
Succeeded parsing /root/.cache/as_model/Qwen2-VL-2B/model.onnx
[07/23/2025-14:35:13] [TRT] [W] /vision_model/Reshape_2: IShuffleLayer with zeroIsPlaceHolder=true has reshape dimension at position 0 that might or might not be zero. TensorRT resolves it at runtime, but this may cause excessive memory consumption and is usually a sign of a bug in the network.
[07/23/2025-14:35:13] [TRT] [W] /vision_model/Reshape_2: IShuffleLayer with zeroIsPlaceHolder=true has reshape dimension at position 2 that might or might not be zero. TensorRT resolves it at runtime, but this may cause excessive memory consumption and is usually a sign of a bug in the network.
[07/23/2025-14:35:13] [TRT] [W] /vision_model/Reshape_4: IShuffleLayer with zeroIsPlaceHolder=true has reshape dimension at position 0 that might or might not be zero. TensorRT resolves it at runtime, but this may cause excessive memory consumption and is usually a sign of a bug in the network.
[07/23/2025-14:35:13] [TRT] [W] /vision_model/Reshape_4: IShuffleLayer with zeroIsPlaceHolder=true has reshape dimension at position 2 that might or might not be zero. TensorRT resolves it at runtime, but this may cause excessive memory consumption and is usually a sign of a bug in the network.
[07/23/2025-14:35:15] [TRT] [I] Graph optimization time: 1.81676 seconds.
[07/23/2025-14:35:15] [TRT] [I] Local timing cache in use. Profiling results in this builder pass will not be stored.
[07/23/2025-14:35:15] [TRT] [W] Was not able to infer a kOPT value for tensor /vision_model/Squeeze_output_0. Using one(s).
[07/23/2025-14:35:15] [TRT] [W] Was not able to infer a kOPT value for tensor /vision_model/ReduceMax_output_0. Using one(s).
[07/23/2025-14:35:18] [TRT] [I] Detected 2 inputs and 1 output network tensors.
[07/23/2025-14:36:17] [TRT] [E] 1: autotuning: CUDA error 2 allocating 687194768222-byte buffer: out of memory
[07/23/2025-14:36:17] [TRT] [E] 1: [codeGenerator.cpp::compileGraph::895] Error Code 1: Myelin (autotuning: CUDA error 2 allocating 687194768222-byte buffer: out of memory)
Traceback (most recent call last):
  File "/usr/local/bin/dashinfer_vlm_serve", line 33, in <module>
    sys.exit(load_entry_point('dashinfer-vlm', 'console_scripts', 'dashinfer_vlm_serve')())
  File "dashinfer_vlm/api_server/server.py", line 685, in main
    init()
  File "dashinfer_vlm/api_server/server.py", line 94, in init
    model_loader.load_model(direct_load=False, load_format="auto")
  File "dashinfer_vlm/vl_inference/utils/model_loader.py", line 165, in serialize
    onnx_trt_obj.generate_trt_engine(onnxFile, self.vision_model_path)
  File "dashinfer_vlm/vl_inference/utils/trt/onnx_to_plan.py", line 195, in generate_trt_engine
    raise RuntimeError("Failed building %s" % planFile)
RuntimeError: Failed building /root/.cache/as_model/Qwen2-VL-2B/model.plan
I20250723 14:36:18.483180 222505 as_engine.cpp:330] ~AsEngine called
I20250723 14:36:18.483215 222505 weight_manager.cpp:721] ~WeightManager
I20250723 14:36:18.483223 222505 as_engine.cpp:348] ~AsEngineImpl finished.
I20250723 14:36:18.483337 222627 thread_pool_with_id.h:91] dummy message for wake up.
I20250723 14:36:18.483361 222627 thread_pool_with_id.h:45] Thread Pool with id: 0 Exit!!!

coder4nlp avatar Jul 23 '25 07:07 coder4nlp

Hi, @kzjeef.Thanks for your response. When using qwen2.5-vl, the service can start normally, but it is extremely slow. Here are my startup commands. However, when the input consists of multiple images, an error will occur.

dashinfer_vlm_serve --model /models/Qwen/Qwen2.5-VL-3B-Instruct  --port 8000 --host 127.0.0.1 --vision_engine transformers
def send_request():
    start_time = time.time()
    response = client.chat.completions.create(
        model="qwen/Qwen2.5-VL-3B-Instruct",
        messages=messages,
        stream=False,
        max_completion_tokens=1024,
        temperature=0.1,
    )
    end_time = time.time()
    latency = end_time - start_time
    return latency
messages=[{
      "role": "user",
      "content": [
            {"type": "text", "text": "Describe the content of the picture"},
            {
               "type": "image_url",
               "image_url": {
                  "url": "https://farm4.staticflickr.com/3075/3168662394_7d7103de7d_z_d.jpg",
               }
            },
            # {
            #    "type": "image_url",
            #    "image_url": {
            #       "url": "https://farm2.staticflickr.com/1533/26541536141_41abe98db3_z_d.jpg",
            #    }
            # },
      ],
   }]
def benchmark(num_requests, num_workers):
    latencies = []
    start_time = time.time()
    with concurrent.futures.ThreadPoolExecutor(max_workers=num_workers) as executor:
        futures = [executor.submit(send_request) for _ in range(num_requests)]
        for future in concurrent.futures.as_completed(futures):
            latencies.append(future.result())
    end_time = time.time()
    total_time = end_time - start_time
    qps = num_requests / total_time
    average_latency = sum(latencies) / len(latencies)
    throughput = num_requests * 1024 / total_time 
    print(f"Total Time: {total_time:.2f} seconds")
    print(f"QPS: {qps:.2f}")
    print(f"Average Latency: {average_latency:.2f} seconds")
    
if __name__ == "__main__":
    num_requests = 100 
    num_workers = 10   
    benchmark(num_requests, num_workers)

coder4nlp avatar Jul 23 '25 08:07 coder4nlp

@kzjeef. I have fixed the qwen2-vl issue, but the inference time is still very slow.

coder4nlp avatar Jul 23 '25 09:07 coder4nlp

@kzjeef. I have fixed the qwen2-vl issue, but the inference time is still very slow.

what's vllm's version?

kzjeef avatar Jul 23 '25 09:07 kzjeef

@kzjeef. I have fixed the qwen2-vl issue, but the inference time is still very slow.

what's your --vision_engine parameter in qwen2-vl test?

kzjeef avatar Jul 23 '25 09:07 kzjeef

@kzjeef. I have fixed the qwen2-vl issue, but the inference time is still very slow.

what's your --vision_engine parameter in qwen2-vl test?

--vision_engine parameter has not received any parameters vllm==0.8.5.post1

coder4nlp avatar Jul 23 '25 10:07 coder4nlp

@kzjeef Would you be able to assist in resolving these matters? Thanks

coder4nlp avatar Jul 24 '25 02:07 coder4nlp

Sure, I will test this in my local.

what's your the model size in your test? And what's the GPU type?

kzjeef avatar Jul 24 '25 03:07 kzjeef

Qwen/Qwen2-VL-2B

Sure, I will test this in my local.

what's your the model size in your test? And what's the GPU type?

Hello,the models I used are Qwen/Qwen2-VL-2B and Qwen/Qwen2.5-VL-3B-Instruct, and the GPU type is H100.

dashinfer_vlm_serve --model /models/Qwen/Qwen2.5-VL-3B-Instruct  --port 8000 --host 127.0.0.1 --vision_engine transformers
dashinfer_vlm_serve --model /models/Qwen/Qwen2-VL-2B --host 127.0.0.1

coder4nlp avatar Jul 24 '25 03:07 coder4nlp

Hi @coder4nlp Here is my test result,

Test Env:

Hardware:

H20 single card

DashInfer

the latest source code.

Dashinfer_vlm

from latest source code.

benchmark command:

basiclly download the data in dash-infer/multimodal/README.md , the data put in tests/data folder: script:

wget https://huggingface.co/datasets/OpenGVLab/InternVL-Chat-V1-2-SFT-Data/resolve/main/opensource/docvqa_train_10k.jsonl
wget https://huggingface.co/datasets/OpenGVLab/InternVL-Chat-V1-2-SFT-Data/resolve/main/data/share_textvqa.zip
unzip share_textvqa.zip

benchmark command

start the server with port:

python tests/benchmark_openai_api.py --prompt-file tests/data/docvqa_train_10k.jsonl --image-folder tests/data/share_textvqa/images/ --req-nums 100 \
        --batch-size 32 \
        --image-nums-mean 3 \
        --image-nums-range 1  \
        --response-mean 120 \
        --response-len-range 64 

it will run single image test with batch size (concurrencly 32)

Server command:

run the serve under: folder: dash-infer/multimodal

Because the data in local file, the related path should be accessable, so my file dir is like this for your reference: dash-infer/multimodal# tree -L 3 Image

Qwen2-VL-2B-Instruct

start command (vit use transformers):

dashinfer_vlm_serve --model /model/Qwen2-VL-2B-Instruct/ --host 127.0.0.1 --vision_engine transformers

Result:

1st time :

Total time: 38.10 sec
input token lens: 2783.60 (average) / 278360 (total) --- 
output token lens: 12.78 (average) / 1278 (total) --- 
QPS:  2.62 requests/sec, TPS: 33.54 tokens/sec

2nd time (with vit cache):

Total time: 8.72 sec
input token lens: 2783.60 (average) / 278360 (total) --- 
output token lens: 10.72 (average) / 1072 (total) --- 
QPS:  11.47 requests/sec, TPS: 122.97 tokens/sec

start command (Vit use TRT):

dashinfer_vlm_serve --model /model/Qwen2-VL-2B-Instruct/ --host 127.0.0.1 --vision_engine tensorrt

Result:

1st time :

Total time: 32.99 sec
input token lens: 2783.60 (average) / 278360 (total) --- output token lens: 11.33 (average) / 1133 (total) --- 
QPS:  3.03 requests/sec, TPS: 34.35 tokens/sec

2nd time (with vit cache):

Total time: 8.76 sec
input token lens: 2783.60 (average) / 278360 (total) --- 
output token lens: 9.37 (average) / 937 (total) --- 
QPS:  11.42 requests/sec, TPS: 106.97 tokens/sec

start command (ViT use TRT + FP8 dynamic quant)

dashinfer_vlm_serve --model /cfs_cloud_code/asherszhang/model/Qwen2-VL-2B-Instruct/ --host 127.0.0.1 --vision_engine tensorrt --quant-type fp8

result

1st time :

Total time: 29.88 secinput token lens: 2783.60 (average) / 278360 (total) --- 
output token lens: 11.52 (average) / 1152 (total) --- QPS:  3.35 requests/sec, TPS: 38.55 tokens/sec

2nd time (with vit cache):

Total time: 6.84 sec
input token lens: 2783.60 (average) / 278360 (total) --- 
output token lens: 9.30 (average) / 930 (total) --- QPS:  14.61 requests/sec, TPS: 135.88 tokens/sec

start command (ViT use TRT + FP8 dynamic quant + prefix cache)

dashinfer_vlm_serve --model /cfs_cloud_code/asherszhang/model/Qwen2-VL-2B-Instruct/ --host 127.0.0.1 --vision_engine tensorrt --enable-prefix-cache --quant-type fp8

result

1st time :

Total time: 29.53 sec
input token lens: 2783.60 (average) / 278360 (total) --- 
output token lens: 10.95 (average) / 1095 (total) --- 
QPS:  3.39 requests/sec, TPS: 37.09 tokens/sec

2nd time (with vit cache):

Total time: 2.81 sec
input token lens: 2783.60 (average) / 278360 (total) --- 
output token lens: 11.25 (average) / 1125 (total) --- QPS:  35.64 requests/sec, TPS: 400.96 tokens/sec

Vllm

version:v0.9.2rc2 + (82b8027be6e8f15603cea823e044069cd10c9c62) start command:

uv run vllm serve /model/Qwen2-VL-2B-Instruct/ --limit-mm-per-prompt '{"image":4}' --allowed-local-media-path `my_image_paths`

1st time :

Total time: 33.14 sec
input token lens: 2794.60 (average) / 279460 (total) --- 
output token lens: 18.24 (average) / 1824 (total) --- 
QPS:  3.02 requests/sec, TPS: 55.03 tokens/sec

2nd time (with vit cache):

Total time: 6.10 sec
input token lens: 2794.60 (average) / 279460 (total) --- 
output token lens: 18.80 (average) / 1880 (total) --- 
QPS:  16.38 requests/sec, TPS: 307.96 tokens/sec

# Qwen2.5-VL-2B-Instruct

dashinfer_vlm_serve --model /cfs_cloud_code/asherszhang/model/Qwen2.5-VL-3B-Instruct/  --host 127.0.0.1 --vision_engine transformers

dashinfer (vit use transfomers)

1st time :

Total time: 93.34 sec
input token lens: 2783.60 (average) / 278360 (total) --- 
output token lens: 19.42 (average) / 1942 (total) --- 
QPS:  1.07 requests/sec, TPS: 20.81 tokens/sec

2nd time (with vit cache):

Total time: 16.74 sec
input token lens: 2783.60 (average) / 278360 (total) --- 
output token lens: 14.86 (average) / 1486 (total) --- 
QPS:  5.97 requests/sec, TPS: 88.75 tokens/sec

vllm start with mm cache

1st time:

Total time: 36.99 sec
input token lens: 2794.60 (average) / 279460 (total) --- 
output token lens: 29.55 (average) / 2955 (total) --- 
QPS:  2.70 requests/sec, TPS: 79.89 tokens/sec

2nd time (with vit cache)

Total time: 5.64 sec
input token lens: 2794.60 (average) / 279460 (total) --- 
output token lens: 28.81 (average) / 2881 (total) --- 
QPS:  17.73 requests/sec, TPS: 510.89 tokens/sec

510 token/s generation, it must be some kind of cache. it will takes 3T memory bandwidth , which already greater than H20's hardware limit.

So I start test with disable cache.

vllm start without mm cache

cmd:

uv run vllm serve /model/Qwen2.5-VL-3B-Instruct/ --limit-mm-per-prompt '{"image":4}' --allowed-local-media-path `path-to-data` --no-enable-prefix-caching --disable-mm-preprocessor-cache

1st time:

Total time: 37.25 sec
input token lens: 2794.60 (average) / 279460 (total) --- 
output token lens: 29.09 (average) / 2909 (total) --- 
QPS:  2.68 requests/sec, TPS: 78.09 tokens/sec

2nd time:

Total time: 31.88 sec
input token lens: 2794.60 (average) / 279460 (total) --- 
output token lens: 29.03 (average) / 2903 (total) --- 
QPS:  3.14 requests/sec, TPS: 91.05 tokens/sec

After disable cache , the data going to normal.

2.70 QPS vs 2.68 in 1st time request, there is no much different in current version vllm.

@coder4nlp for your test, I think because you're using same prompt + same image, which cause a lots of caching, special mm cache takes effect.

kzjeef avatar Jul 24 '25 06:07 kzjeef

Qwen/Qwen2-VL-2B

Sure, I will test this in my local. what's your the model size in your test? And what's the GPU type?

Hello,the models I used are Qwen/Qwen2-VL-2B and Qwen/Qwen2.5-VL-3B-Instruct, and the GPU type is H100.

dashinfer_vlm_serve --model /models/Qwen/Qwen2.5-VL-3B-Instruct  --port 8000 --host 127.0.0.1 --vision_engine transformers
dashinfer_vlm_serve --model /models/Qwen/Qwen2-VL-2B --host 127.0.0.1

What's your concurrency in your test ?

kzjeef avatar Jul 24 '25 06:07 kzjeef

@kzjeef. Thank you for your test results. Please check the previous reply. Concurrency is 10. I have already provided the complete test code.

coder4nlp avatar Jul 24 '25 06:07 coder4nlp

@kzjeef. Could you please tell me how to set up "with vit cache"?

coder4nlp avatar Jul 24 '25 08:07 coder4nlp

When I make a request and it's a concurrent operation, dashinfer takes 10 seconds.

coder4nlp avatar Jul 24 '25 08:07 coder4nlp

@kzjeef. Could you please tell me how to set up "with vit cache"?

it's the cache for the image embedding, it will save a lot compute for the image. in vllm, you can disable by --disable-mm-preprocessor-cache

it will affect the test result a lot, if you're using same image to do test.

In my test, I use 100 images, but still affect a lot.

kzjeef avatar Jul 24 '25 08:07 kzjeef

@kzjeef . Without considering multiple requests, using a single sample, dashinfer was also extremely slow in my tests. I have no idea what the reason is.

coder4nlp avatar Jul 24 '25 08:07 coder4nlp

@kzjeef . Without considering multiple requests, using a single sample, dashinfer was also extremely slow in my tests. I have no idea what the reason is.

that's not normal, what's your os and cuda version ?

kzjeef avatar Jul 24 '25 08:07 kzjeef

When I make a request and it's a concurrent operation, dashinfer takes 10 seconds.

can you provide a full startup log for dashinfer_vlm ? maybe some error in start up log.

kzjeef avatar Jul 24 '25 08:07 kzjeef

@kzjeef As the log is too long, I have placed it in the attachment.

server.txt

coder4nlp avatar Jul 24 '25 09:07 coder4nlp

 [StopRequest] Request ID: 00000000000000000000000000000192, Context time(ms): 46, Generate time(ms): 8121, Context Length: 383, Generated Length: 147, Context TPS: 8308.03, Generate TPS: 18.101, Prefix Cache Len: 0

coder4nlp avatar Jul 24 '25 09:07 coder4nlp

 [StopRequest] Request ID: 00000000000000000000000000000192, Context time(ms): 46, Generate time(ms): 8121, Context Length: 383, Generated Length: 147, Context TPS: 8308.03, Generate TPS: 18.101, Prefix Cache Len: 0

Here is log from my side:

[StopRequest] Request ID: 00000000000000000000000000000230, Context time(ms): 142, Generate time(ms): 184, Context Length: 2912, Generated Length: 6, Context TPS: 20492.6, Generate TPS: 32.591, Prefix Cache Len: 0 
 [StopRequest] Request ID: 00000000000000000000000000000228, Context time(ms): 153, Generate time(ms): 847, Context Length: 2980, Generated Length: 11, Context TPS: 19464.4, Generate TPS: 12.9855, Prefix Cache Len: 0

I think I found the some diff between your deploy and my:

Image

This value's default value changed to 128 in further release, but seems your release still 32, it have some affect on generation speed.

But I think the most major affection is the prefill maybe lagging the generation.

Can you catpure the log set this var?

you can enable the decdoer log by this env var: ALLSPARK_TIME_LOG=1

Here is log in my side:

There is one factor may affect, when there is a prefill running, the decoder will be slower, like:

Decoder Loop Time [TPOT] (ms): 128 running: 5 alloc: 0.206 forward_time: 15.792 reshape: 0.118 gen_frd: 112.087 post_gen: 0.004

but if there is no prefill,

Decoder Loop Time [TPOT] (ms): 13 running: 5 alloc: 0.173 forward_time: 10.565 reshape: 0.125 gen_frd: 2.91 post_gen: 0.001

If running under single request, the time will be

I20250724 17:56:54.189312 2678454 model.cpp:1411] Decoder Loop Time [TPOT] (ms): 4 running: 1 alloc: 0.104 forward_time: 3.585 reshape: 0.027 gen_frd: 0.859 post_gen: 0
I20250724 17:56:54.194442 2678454 model.cpp:974] Stop request with request id: 00000000000000000000000000000232                                                                                                                                                            I20250724 17:56:54.194453 2678454 model.cpp:999] [StopRequest] Request ID: 00000000000000000000000000000232, Context time(ms): 14, Generate time(ms): 238, Context Length: 382, Generated Length: 51, Context TPS: 27092.2, Generate TPS: 214.196, Prefix Cache Len: 0

the decoder only takes 4ms,

it support my guess that major affect is the prefill concurrent running with decdoer.

kzjeef avatar Jul 24 '25 09:07 kzjeef

@kzjeef When I updated the version of dashinfer from 2.0.0 to 2.1.0, the running time of a single request decreased from 10 seconds to 1 second. However, vllm only took 0.46 seconds.

coder4nlp avatar Jul 24 '25 11:07 coder4nlp

in vllm Prefix cache hit rate: 99.5%

[loggers.py:111] Engine 000: Avg prompt throughput: 2166.9 tokens/s, Avg generation throughput: 470.3 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 99.5%

coder4nlp avatar Jul 24 '25 12:07 coder4nlp

@kzjeef This is my test result. Based on the experimental results, dashinfer still has a gap in QPS compared to vllm.Could you please tell me what I should do to make the QPS of dashinfer higher than that of vllm? qwen2-vl-2b-instruct QPS

框架 1 2 4 8 10 20 40
dashinfer 1.28 2.49 4.92 8.75 10.16 12.37 22.31
vllm 2.58 4.40 7.54 12.28 14.11 20.94 24.60
vllm(--no-enable-prefix-caching --disable-mm-preprocessor-cache) 2.45 4.27 7.29 11.76 13.81 18.78 21.75

coder4nlp avatar Jul 24 '25 12:07 coder4nlp

@kzjeef This is my test result. Based on the experimental results, dashinfer still has a gap in QPS compared to vllm.Could you please tell me what I should do to make the QPS of dashinfer higher than that of vllm? qwen2-vl-2b-instruct QPS

框架 1 2 4 8 10 20 40 dashinfer 1.28 2.49 4.92 8.75 10.16 12.37 22.31 vllm 2.58 4.40 7.54 12.28 14.11 20.94 24.60 vllm(--no-enable-prefix-caching --disable-mm-preprocessor-cache) 2.45 4.27 7.29 11.76 13.81 18.78 21.75

Thanks for the test,

I just found two method,

  1. enable prefix cache, as vllm 's default option,
  2. dynamic FP8 per-tensor quant for LLM model.

Here are two results:

Test on Qwen2-VL-2B-Instruct:

start command (ViT use TRT + FP8 dynamic quant)

dashinfer_vlm_serve --model /cfs_cloud_code/asherszhang/model/Qwen2-VL-2B-Instruct/ --host 127.0.0.1 --vision_engine tensorrt --quant-type fp8

NOTE: You may needs to check the accuracy after fp8 quantization.

result

1st time :

Total time: 29.88 secinput token lens: 2783.60 (average) / 278360 (total) --- 
output token lens: 11.52 (average) / 1152 (total) --- QPS:  3.35 requests/sec, TPS: 38.55 tokens/sec

2nd time (with vit cache):

Total time: 6.84 sec
input token lens: 2783.60 (average) / 278360 (total) --- 
output token lens: 9.30 (average) / 930 (total) --- QPS:  14.61 requests/sec, TPS: 135.88 tokens/sec

start command (ViT use TRT + FP8 dynamic quant + prefix cache)

dashinfer_vlm_serve --model /cfs_cloud_code/asherszhang/model/Qwen2-VL-2B-Instruct/ --host 127.0.0.1 --vision_engine tensorrt --enable-prefix-cache --quant-type fp8

result

1st time :

Total time: 29.53 sec
input token lens: 2783.60 (average) / 278360 (total) --- 
output token lens: 10.95 (average) / 1095 (total) --- 
QPS:  3.39 requests/sec, TPS: 37.09 tokens/sec

2nd time (with vit cache + prefix cache):

Total time: 2.81 sec
input token lens: 2783.60 (average) / 278360 (total) --- 
output token lens: 11.25 (average) / 1125 (total) --- QPS:  35.64 requests/sec, TPS: 400.96 tokens/sec

kzjeef avatar Jul 24 '25 14:07 kzjeef

@kzjeef Strangely enough, dashinfe seems to be unstable. Dashinfer is performing much more slowly today than yesterday. I have no idea what the reason is.

coder4nlp avatar Jul 25 '25 02:07 coder4nlp

@kzjeef Strangely enough, dashinfe seems to be unstable. Dashinfer is performing much more slowly today than yesterday. I have no idea what the reason is.

Any change of these items ?

  1. Hardware utility (maybe shared with other guys)
  2. Input Data
  3. Free GPU VRAM size

dashinfer is written by c++ mostly, it don't factory that may random like GC etc.

kzjeef avatar Jul 25 '25 03:07 kzjeef

@kzjeef When using Qwen2.5-VL-3B-Instruct with --enable-prefix-cache, an error occurs.

 File "dash-infer/multimodal/dashinfer_vlm/api_server/server.py", line 684, in main
    init()
  File "dashinfer_vlm/api_server/server.py", line 143, in init
    vl_engine = QwenVl(
  File "dash-infer/multimodal/dashinfer_vlm/vl_inference/runtime/qwen_vl.py", line 231, in __init__
    self.as_worker = HieAllsparkWorker(as_config)
  File "dash-infer/multimodal/dashinfer_vlm/vl_inference/runtime/hie_allspark_worker.py", line 17, in __init__
    self.model = AllSparkM6Model(as_model_config)
  File "dash-infer/multimodal/dashinfer_vlm/vl_inference/utils/hie_allspark/model_hie_allspark.py", line 63, in __init__
    assert status == AsStatus.ALLSPARK_SUCCESS
AssertionError

coder4nlp avatar Jul 28 '25 03:07 coder4nlp