dash-infer TRT Support for Qwen2.5VL Error.

Code : 3945858be258c95656fdeabcaf56413b35dd368e

Test method: dashinfer_vlm_serve --model Qwen2.5-VL-3B-Instruct --host 127.0.0.1 --vision_engine tensorrt

Version:

transformers             4.54.0
torch                    2.7.1
torchvision              0.22.1
onnx                     1.18.0
tensorrt                 10.5.0
tensorrt-cu12            10.13.0.35
tensorrt-cu12-bindings   10.13.0.35
tensorrt-cu12-libs       10.13.0.35

Error log:

call setenv()                                                
AllSpark python package start init.                                                                                              
[Info] No Multi-NUMA support on CUDA Version.                                                                                                                                                                                                                     
[INFO   ]  args: Namespace(host='127.0.0.1', port=8000, allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_keys=None, ssl=False, model='/cfs_cloud_code/asherszhang/model/Qwen2.5-VL-3B-Instruct/', vision_engine='
tensorrt', device='cuda', max_length=32000, max_batch=128, parallel_size=1, enable_prefix_cache=False, quant_type=None, dtype='bfloat16', min_pixels=3136, max_pixels=12845056)
defaultdict(None, {'host': '127.0.0.1', 'port': 8000, 'allow_credentials': False, 'allowed_origins': ['*'], 'allowed_methods': ['*'], 'allowed_headers': ['*'], 'api_keys': None, 'ssl': False, 'model': '/cfs_cloud_code/asherszhang/model/Qwen2.5-VL-3B-Instruct
/', 'vision_engine': 'tensorrt', 'device': 'cuda', 'max_length': 32000, 'max_batch': 128, 'parallel_size': 1, 'enable_prefix_cache': False, 'quant_type': None, 'dtype': 'bfloat16', 'min_pixels': 3136, 'max_pixels': 12845056})
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00,  3.27it/s]
The image processor of type `Qwen2VLImageProcessor` is now loaded as a fast processor by default, even if the model checkpoint was saved with a slow processor. This is a breaking change and may produce slightly different outputs. To continue using the slow p
rocessor, instantiate this class with `use_fast=False`. Note that this behavior will be extended to all models in a future release.
You have video processor config saved in `preprocessor.json` file which is deprecated. Video processor configs should be saved in their own `video_preprocessor.json` file. You can rename the file or load and save the processor back which renames it automatic
ally. Loading from `preprocessor.json` will be removed in v5.0.
The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
model config:
{'vision_config': Qwen2_5_VLVisionConfig {
  "depth": 32,
  "fullatt_block_indexes": [
    7,
    15,
    23,
    31
  ],
  "hidden_act": "silu",
  "hidden_size": 1280,
  "in_channels": 3,
  "in_chans": 3,
  "initializer_range": 0.02,
  "intermediate_size": 3420,
  "model_type": "qwen2_5_vl",
  "num_heads": 16,
  "out_hidden_size": 2048,
  "patch_size": 14,
  "spatial_merge_size": 2,
  "spatial_patch_size": 14,
  "temporal_patch_size": 2,
  "tokens_per_second": 2,
  "transformers_version": "4.54.0",
  "window_size": 112
}
, 'text_config': Qwen2_5_VLTextConfig {
  "architectures": [
    "Qwen2_5_VLForConditionalGeneration"
  ],
  "attention_dropout": 0.0,
  "bos_token_id": 151643,
  "eos_token_id": 151645,
  "hidden_act": "silu",
  "hidden_size": 2048,
  "image_token_id": null,
  "initializer_range": 0.02,
  "intermediate_size": 11008,
  "layer_types": [
    "full_attention",
    "full_attention",
    "full_attention",
    "full_attention",
    "full_attention",
    "full_attention",
    "full_attention",
    "full_attention",
    "full_attention",
  "full_attention",
    "full_attention",
    "full_attention",
    "full_attention",
    "full_attention",
    "full_attention",
    "full_attention",
    "full_attention",
    "full_attention",
    "full_attention",
    "full_attention",
    "full_attention",
    "full_attention",
    "full_attention",
    "full_attention",
    "full_attention",
    "full_attention",
    "full_attention",
    "full_attention",
    "full_attention",
    "full_attention",
    "full_attention"
  ],
  "max_position_embeddings": 128000,
  "max_window_layers": 70,
  "model_type": "qwen2_5_vl_text",
  "num_attention_heads": 16,
  "num_hidden_layers": 36,
  "num_key_value_heads": 2,
  "rms_norm_eps": 1e-06,
  "rope_scaling": {
    "mrope_section": [
      16,
      24,
      24
    ],
    "rope_type": "default",
    "type": "default"
  },
  "rope_theta": 1000000.0,
  "sliding_window": null,
  "tie_word_embeddings": true,
  "torch_dtype": "bfloat16",
  "transformers_version": "4.54.0",
  "use_cache": true,
  "use_sliding_window": false,
  "video_token_id": null,
  "vision_end_token_id": 151653,
  "vision_start_token_id": 151652,
  "vision_token_id": 151654,
  "vocab_size": 151936
}
, 'image_token_id': 151655, 'video_token_id': 151656, 'return_dict': True, 'output_hidden_states': False, 'torchscript': False, 'torch_dtype': torch.bfloat16, '_output_attentions': False, 'pruned_heads': {}, 'tie_word_embeddings': True, 'chunk_size_feed_forw
ard': 0, 'is_encoder_decoder': False, 'is_decoder': False, 'cross_attention_hidden_size': None, 'add_cross_attention': False, 'tie_encoder_decoder': False, 'architectures': ['Qwen2_5_VLForConditionalGeneration'], 'finetuning_task': None, 'id2label': {0: 'LAB
EL_0', 1: 'LABEL_1'}, 'label2id': {'LABEL_0': 0, 'LABEL_1': 1}, 'task_specific_params': None, 'problem_type': None, 'tokenizer_class': None, 'prefix': None, 'bos_token_id': 151643, 'pad_token_id': None, 'eos_token_id': 151645, 'sep_token_id': None, 'decoder_
start_token_id': None, 'max_length': 20, 'min_length': 0, 'do_sample': False, 'early_stopping': False, 'num_beams': 1, 'num_beam_groups': 1, 'diversity_penalty': 0.0, 'temperature': 1.0, 'top_k': 50, 'top_p': 1.0, 'typical_p': 1.0, 'repetition_penalty': 1.0,
 'length_penalty': 1.0, 'no_repeat_ngram_size': 0, 'encoder_no_repeat_ngram_size': 0, 'bad_words_ids': None, 'num_return_sequences': 1, 'output_scores': False, 'return_dict_in_generate': False, 'forced_bos_token_id': None, 'forced_eos_token_id': None, 'remov
e_invalid_values': False, 'exponential_decay_length_penalty': None, 'suppress_tokens': None, 'begin_suppress_tokens': None, '_name_or_path': '/cfs_cloud_code/asherszhang/model/Qwen2.5-VL-3B-Instruct/', '_commit_hash': None, '_attn_implementation_internal': N
one, 'transformers_version': '4.41.2', 'attention_dropout': 0.0, 'vision_start_token_id': 151652, 'vision_end_token_id': 151653, 'vision_token_id': 151654, 'hidden_act': 'silu', 'hidden_size': 2048, 'initializer_range': 0.02, 'intermediate_size': 11008, 'max
_position_embeddings': 128000, 'max_window_layers': 70, 'model_type': 'Qwen_v20', 'num_attention_heads': 16, 'num_hidden_layers': 36, 'num_key_value_heads': 2, 'rms_norm_eps': 1e-06, 'rope_theta': 1000000.0, 'sliding_window': 32768, 'use_cache': True, 'use_s
liding_window': False, 'rope_scaling': {'type': 'default', 'mrope_section': [16, 24, 24], 'rope_type': 'default'}, 'vocab_size': 151936, 'tf_legacy_loss': False, 'use_bfloat16': False, 'rotary_emb_base': 1000000.0, 'size_per_head': 128}
WARNING: Logging before InitGoogleLogging() is written to STDERR 
I20250729 23:12:53.070276 1730607 thread_pool_with_id.h:37] ThreadPoolWithID init with thread number: 1                                                                                                                                                [214/94864]
I20250729 23:12:53.070394 1730607 thread_pool_with_id.h:37] ThreadPoolWithID init with thread number: 1
I20250729 23:12:53.070473 1730607 as_engine.cpp:107] AllSpark Init with Version: 2.4.0/(GitSha1:169754a8-dirty)
Qwen VL 2.5 start convert onnx.
The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
Start converting ONNX model!
Loading safetensors checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00,  5.34it/s]
/dockerdata/dash-infer/dash-infer/multimodal/dashinfer_vlm/visual_embedding/DFN_vit_2_5.py:374: TracerWarning: Iterating over a tensor might cause the trace to be incorrect. Passing a tensor of different shape won't change the number of iterations executed (
and might lead to errors or silently give incorrect results).
  for t, h, w in grid_thw:
/dockerdata/dash-infer/dash-infer/multimodal/dashinfer_vlm/visual_embedding/DFN_vit_2_5.py:416: TracerWarning: Iterating over a tensor might cause the trace to be incorrect. Passing a tensor of different shape won't change the number of iterations executed (
and might lead to errors or silently give incorrect results).
  for grid_t, grid_h, grid_w in grid_thw:
/dockerdata/dash-infer/dash-infer/multimodal/dashinfer_vlm/visual_embedding/DFN_vit_2_5.py:435: TracerWarning: Converting a tensor to a Python list might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be t
reated as a constant in the future. This means that the trace might not generalize to other inputs!
  cu_window_seqlens.extend(cu_seqlens_tmp.tolist())
/dockerdata/dash-infer/dash-infer/multimodal/dashinfer_vlm/visual_embedding/DFN_vit_2_5.py:436: TracerWarning: Converting a tensor to a Python number might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be
 treated as a constant in the future. This means that the trace might not generalize to other inputs!
  window_index_id += (grid_t * llm_grid_h * llm_grid_w).item()
/dockerdata/dash-infer/dash-infer/multimodal/dashinfer_vlm/visual_embedding/DFN_vit_2_5.py:445: TracerWarning: torch.tensor results are registered as constants in the trace. You can safely ignore this warning if you use this function to create tensors out of
 constant variables that would be the same every time you call this function. In any other case, this might cause the trace to be incorrect.
  cu_window_seqlens = torch.tensor(
Export to ONNX file successfully! The ONNX file stays in /root/.cache/as_model/model.onnx
Start converting TRT engine!
[07/29/2025-23:13:41] [TRT] [I] [MemUsageChange] Init CUDA: CPU -2, GPU +0, now: CPU 10806, GPU 9316 (MiB)
[07/29/2025-23:13:42] [TRT] [I] [MemUsageChange] Init builder kernel library: CPU -1917, GPU +6, now: CPU 8687, GPU 9322 (MiB)
[07/29/2025-23:13:42] [TRT] [I] ----------------------------------------------------------------
[07/29/2025-23:13:42] [TRT] [I] Input filename:   /root/.cache/as_model/model.onnx
[07/29/2025-23:13:42] [TRT] [I] ONNX IR version:  0.0.8
[07/29/2025-23:13:42] [TRT] [I] Opset version:    17
[07/29/2025-23:13:42] [TRT] [I] Producer name:    pytorch
[07/29/2025-23:13:42] [TRT] [I] Producer version: 2.7.1
[07/29/2025-23:13:42] [TRT] [I] Domain:           
[07/29/2025-23:13:42] [TRT] [I] Model version:    0
[07/29/2025-23:13:42] [TRT] [I] Doc string:       
[07/29/2025-23:13:42] [TRT] [I] ----------------------------------------------------------------
[07/29/2025-23:13:42] [TRT] [W] ModelImporter.cpp:653: Make sure input grid_thw has Int64 binding.
Succeeded parsing /root/.cache/as_model/model.onnx
[07/29/2025-23:13:43] [TRT] [W] /vision_model/Reshape_12: IShuffleLayer with zeroIsPlaceHolder=true has reshape dimension at position 0 that might or might not be zero. TensorRT resolves it at runtime, but this may cause excessive memory consumption and is u
sually a sign of a bug in the network.
[07/29/2025-23:13:43] [TRT] [W] /vision_model/Reshape_6: IShuffleLayer with zeroIsPlaceHolder=true has reshape dimension at position 0 that might or might not be zero. TensorRT resolves it at runtime, but this may cause excessive memory consumption and is us
ually a sign of a bug in the network.
[07/29/2025-23:13:43] [TRT] [W] /vision_model/Reshape_6: IShuffleLayer with zeroIsPlaceHolder=true has reshape dimension at position 1 that might or might not be zero. TensorRT resolves it at runtime, but this may cause excessive memory consumption and is us
ually a sign of a bug in the network.
[07/29/2025-23:13:43] [TRT] [W] /vision_model/Reshape_6: IShuffleLayer with zeroIsPlaceHolder=true has reshape dimension at position 2 that might or might not be zero. TensorRT resolves it at runtime, but this may cause excessive memory consumption and is us
ually a sign of a bug in the network.
[07/29/2025-23:13:43] [TRT] [W] /vision_model/Reshape_2: IShuffleLayer with zeroIsPlaceHolder=true has reshape dimension at position 0 that might or might not be zero. TensorRT resolves it at runtime, but this may cause excessive memory consumption and is us
ually a sign of a bug in the network.
[07/29/2025-23:13:43] [TRT] [W] /vision_model/Reshape_2: IShuffleLayer with zeroIsPlaceHolder=true has reshape dimension at position 2 that might or might not be zero. TensorRT resolves it at runtime, but this may cause excessive memory consumption and is us
ually a sign of a bug in the network.
[07/29/2025-23:13:43] [TRT] [W] /vision_model/Reshape_4: IShuffleLayer with zeroIsPlaceHolder=true has reshape dimension at position 0 that might or might not be zero. TensorRT resolves it at runtime, but this may cause excessive memory consumption and is us
ually a sign of a bug in the network.
[07/29/2025-23:13:43] [TRT] [W] /vision_model/Reshape_4: IShuffleLayer with zeroIsPlaceHolder=true has reshape dimension at position 2 that might or might not be zero. TensorRT resolves it at runtime, but this may cause excessive memory consumption and is us
ually a sign of a bug in the network.
[07/29/2025-23:13:43] [TRT] [W] /vision_model/Reshape_9: IShuffleLayer with zeroIsPlaceHolder=true has reshape dimension at position 0 that might or might not be zero. TensorRT resolves it at runtime, but this may cause excessive memory consumption and is us
ually a sign of a bug in the network.
[07/29/2025-23:13:43] [TRT] [W] /vision_model/Reshape_9: IShuffleLayer with zeroIsPlaceHolder=true has reshape dimension at position 1 that might or might not be zero. TensorRT resolves it at runtime, but this may cause excessive memory consumption and is us
ually a sign of a bug in the network.
[07/29/2025-23:13:43] [TRT] [W] /vision_model/Reshape_9: IShuffleLayer with zeroIsPlaceHolder=true has reshape dimension at position 3 that might or might not be zero. TensorRT resolves it at runtime, but this may cause excessive memory consumption and is us
ually a sign of a bug in the network.
[07/29/2025-23:13:43] [TRT] [W] /vision_model/Reshape_10: IShuffleLayer with zeroIsPlaceHolder=true has reshape dimension at position 0 that might or might not be zero. TensorRT resolves it at runtime, but this may cause excessive memory consumption and is u
sually a sign of a bug in the network.
[07/29/2025-23:13:43] [TRT] [W] /vision_model/Reshape_10: IShuffleLayer with zeroIsPlaceHolder=true has reshape dimension at position 1 that might or might not be zero. TensorRT resolves it at runtime, but this may cause excessive memory consumption and is u
sually a sign of a bug in the network.
[07/29/2025-23:13:43] [TRT] [W] /vision_model/Reshape_14: IShuffleLayer with zeroIsPlaceHolder=true has reshape dimension at position 0 that might or might not be zero. TensorRT resolves it at runtime, but this may cause excessive memory consumption and is u
sually a sign of a bug in the network.
[07/29/2025-23:13:43] [TRT] [W] Detected layernorm nodes in FP16.
[07/29/2025-23:13:43] [TRT] [W] Running layernorm after self-attention with FP16 Reduce or Pow may cause overflow. Forcing Reduce or Pow Layers in FP32 precision, or exporting the model to use INormalizationLayer (available with ONNX opset >= 17) can help pr
eserving accuracy.
[07/29/2025-23:13:43] [TRT] [I] Local timing cache in use. Profiling results in this builder pass will not be stored.
[07/29/2025-23:13:43] [TRT] [W] Was not able to infer kOPT value(s) for tensor /vision_model/ReduceMax_output_0. Using one(s).
[07/29/2025-23:13:43] [TRT] [W] Was not able to infer kOPT value(s) for tensor /vision_model/ReduceMax_output_0. Using one(s).
[07/29/2025-23:13:44] [TRT] [I] Compiler backend is used during engine build.
[07/29/2025-23:13:55] [TRT] [I] Detected 2 inputs and 1 output network tensors.
[07/29/2025-23:14:00] [TRT] [E] IBuilder::buildSerializedNetwork: Error Code 1: Myelin ([shape.cpp:verify_output_type:1583] Mismatched type for tensor ONNXTRT_squeezeTensor_6846_output, i32 vs. expected type:i64. In compileGraph at optimizer/myelin/codeGener
ator.cpp:1346)
Traceback (most recent call last):
  File "/dockerdata/dash-infer/dash-infer/python/.venv/bin/dashinfer_vlm_serve", line 10, in <module>
    sys.exit(main())
  File "/dockerdata/dash-infer/dash-infer/multimodal/dashinfer_vlm/api_server/server.py", line 685, in main
    init()
  File "/dockerdata/dash-infer/dash-infer/multimodal/dashinfer_vlm/api_server/server.py", line 94, in init
    model_loader.load_model(direct_load=False, load_format="auto")
  File "/dockerdata/dash-infer/dash-infer/multimodal/dashinfer_vlm/vl_inference/utils/model_loader.py", line 170, in serialize
    onnx_trt_obj.generate_trt_engine(onnxFile, self.vision_model_path)
  File "/dockerdata/dash-infer/dash-infer/multimodal/dashinfer_vlm/vl_inference/utils/trt/onnx_to_plan.py", line 203, in generate_trt_engine
    raise RuntimeError("Failed building %s" % planFile)
RuntimeError: Failed building /root/.cache/as_model/model.plan
I20250729 23:14:01.364504 1730607 as_engine.cpp:113] ~AsEngine called
I20250729 23:14:01.364549 1730607 as_engine.cpp:119] model_state_map_ size:0
I20250729 23:14:01.364559 1730607 weight_manager.cpp:721] ~WeightManager
I20250729 23:14:01.364566 1730607 as_engine.cpp:143] ~AsEngineImpl finished.
I20250729 23:14:01.364686 1730787 thread_pool_with_id.h:93] dummy message for wake up.
I20250729 23:14:01.364728 1730787 thread_pool_with_id.h:47] Thread Pool with id: 0 Exit!!!
I20250729 23:14:01.364852 1730786 thread_pool_with_id.h:93] dummy message for wake up.
I20250729 23:14:01.364892 1730786 thread_pool_with_id.h:47] Thread Pool with id: 0 Exit!!!

@x574chen Which model or transfomers lib your're using ?

Jul 29 '25 14:07 kzjeef

@kzjeef dashinfer_vlm_serve --model /mnt/ssd/xchen/workspace/Qwen2.5-VL-3B-Instruct/ --vision_engine tensorrt 2>&1 | tee /tmp/vl2_5_trt.log

vl2_5_trt.log

transformers==4.52.3 TensorRT-10.5.0.18

Jul 30 '25 02:07 x574chen

maybe it's caused by I have some newer version of trt installled.

tensorrt 10.5.0 tensorrt-cu12 10.13.0.35 tensorrt-cu12-bindings 10.13.0.35 tensorrt-cu12-libs 10.13.0.35

after uninstall the tensorrt-cu12* , it report this error.

_per_head': 128} Qwen VL 2.5 start convert onnx. Traceback (most recent call last): File "/dockerdata/dash-infer/dash-infer/python/.venv/bin/dashinfer_vlm_serve", line 10, in sys.exit(main()) File "/dockerdata/dash-infer/dash-infer/multimodal/dashinfer_vlm/api_server/server.py", line 685, in main init() File "/dockerdata/dash-infer/dash-infer/multimodal/dashinfer_vlm/api_server/server.py", line 94, in init model_loader.load_model(direct_load=False, load_format="auto") File "/dockerdata/dash-infer/dash-infer/multimodal/dashinfer_vlm/vl_inference/utils/model_loader.py", line 168, in serialize onnx_trt_obj = ONNX_TRT(self.hf_model_path, is_qwen_2_5=is_qwen_2_5) NameError: name 'ONNX_TRT' is not defined

I'll reinstall my trt and try again.

Jul 30 '25 03:07 kzjeef

After reinstall the env, it works again, maybe caused by flash attn , or onnx version.

Jul 31 '25 03:07 kzjeef

I found an error when running video example:

service start cmd:

export DS_LLM_MAX_TOKENS=128000
export DS_LLM_MAX_IN_TOKENS=64000
uv run dashinfer_vlm_serve --model path-to-model/Qwen2.5-VL-3B-Instruct/  --host 127.0.0.1 --vision_engine tensorrt --max_length 64000

client cmd:

tests/test_openai_chat_completion.py --host=127.0.0.1

There is two issues:

the single image return :

I'm sorry, but I can't assist with that.

the video return nothing:

and found an error in server

[07/31/2025-11:49:12] [TRT] [E] IExecutionContext::enqueueV3: Error Code 7: Internal Error (/vision_model/TopK: K exceeds the maximum value allowed (3840). Condition '<' violated: 4452 >= 3841. Instruction: CHECK_LESS 4452 3841.)

Jul 31 '25 03:07 kzjeef

Seems after 4ebf6177 commit, start with transformers break ?

cmmand: dashinfer_vlm_serve --model model/Qwen2.5-VL-3B-Instruct/ --parallel_size 1 --host 127.0.0.1 --vision_engine transformers --max_length 64000

reports:

[INFO   ]  init hie process, workers: 1                                                                                                                       
Exception in thread Thread-2:                                                                                                     
Traceback (most recent call last):                                                                                                                                                                                                                                
  File "/data/home/asherszhang/.local/share/uv/python/cpython-3.12.11-linux-x86_64-gnu/lib/python3.12/threading.py", line 1075, in _bootstrap_inner
    self.run()                                                                                                          
  File "/data/home/asherszhang/code/dash-infer/multimodal/dashinfer_vlm/vl_inference/runtime/hie_worker.py", line 103, in run    
    self.model(image.to(self.device), grid_thw=grid_thw.to(self.device))                                                                                                                                                                                          
  File "/data/home/asherszhang/code/dash-infer/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl                                                                                                             
    return self._call_impl(*args, **kwargs)                                                                        
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                            
  File "/data/home/asherszhang/code/dash-infer/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl                                                                                                                           
    return forward_call(*args, **kwargs)                                                                                                                                           
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                                                                                                                                          
  File "/data/home/asherszhang/code/dash-infer/.venv/lib/python3.12/site-packages/transformers/models/qwen2_5_vl/modeling_qwen2_5_vl.py", line 480, in forward                                                                       
    hidden_states = blk(                                                                                                                                                                                                                                          
                    ^^^^                                                                                                                                                                                                                                          
  File "/data/home/asherszhang/code/dash-infer/.venv/lib/python3.12/site-packages/transformers/modeling_layers.py", line 94, in __call__                                                                                                                          
    return super().__call__(*args, **kwargs)                                                                                                                                                                                                                      
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                                                                                                                                      
  File "/data/home/asherszhang/code/dash-infer/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl                                                                                                                   
    return self._call_impl(*args, **kwargs)                                                                                                                                                                                                                       
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                                                                                                                                       
  File "/data/home/asherszhang/code/dash-infer/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl                                                                                                           
    return forward_call(*args, **kwargs)                                                                                                                      
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                        
  File "/data/home/asherszhang/code/dash-infer/.venv/lib/python3.12/site-packages/accelerate/hooks.py", line 175, in new_forward
    output = module._old_forward(*args, **kwargs)                                                                                                             
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                             
  File "/data/home/asherszhang/code/dash-infer/.venv/lib/python3.12/site-packages/transformers/models/qwen2_5_vl/modeling_qwen2_5_vl.py", line 309, in forward
    self.norm1(hidden_states),                                                                                                          
    ^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                  
  File "/data/home/asherszhang/code/dash-infer/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl                                                                                                                   
    return self._call_impl(*args, **kwargs)                                                                                                                                                                                                                       
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                                                                                                                                       
  File "/data/home/asherszhang/code/dash-infer/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl                                                                                                                           
    return forward_call(*args, **kwargs)                                                                                                                                                                                                                          
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                                                                                                                                          
  File "/data/home/asherszhang/code/dash-infer/.venv/lib/python3.12/site-packages/accelerate/hooks.py", line 175, in new_forward                                                                                                                                  
    output = module._old_forward(*args, **kwargs)                                                                                                                                                                                                                 
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                                                                                                                                 
  File "/data/home/asherszhang/code/dash-infer/.venv/lib/python3.12/site-packages/transformers/models/qwen2_5_vl/modeling_qwen2_5_vl.py", line 117, in forward                                                                              
    return self.weight * hidden_states.to(input_dtype)                                                                                                        
           ~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~                                                                                                                                                                                                      
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:1!

Test on a H20x8 machine.

Jul 31 '25 04:07 kzjeef

@kzjeef no update on transformers for vit

dashinfer_vlm_serve --model /mnt/ssd/xchen/workspace/Qwen2.5-VL-3B-Instruct/ --vision_engine transformers --parallel_size 1 2>&1 | tee /tmp/vl2_5_transformers.log

vl2_5_transformers.log

transformers==4.52.3

Jul 31 '25 05:07 x574chen