VLMEvalKit icon indicating copy to clipboard operation
VLMEvalKit copied to clipboard

多卡推理qwen2.5-vl-7B的时候张量并行报错

Open FontMLLM opened this issue 8 months ago • 5 comments

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 torchrun --nproc-per-node=8 run.py --data MMBench_TEST_EN MMBench_TEST_CN MMStar MME MMMU_TEST Q-Bench1_TEST --model Qwen2.5-VL-7B-Instruct --verbose

报错: File "/home/zhoupc/anaconda3/envs/qwen/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl return self._call_impl(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/zhoupc/anaconda3/envs/qwen/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1845, in _call_impl return inner() ^^^^^^^ File "/home/zhoupc/anaconda3/envs/qwen/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1793, in inner result = forward_call(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/zhoupc/anaconda3/envs/qwen/lib/python3.11/site-packages/torch/nn/modules/conv.py", line 725, in forward return self._conv_forward(input, self.weight, self.bias) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/zhoupc/anaconda3/envs/qwen/lib/python3.11/site-packages/torch/nn/modules/conv.py", line 720, in _conv_forward return F.conv3d( ^^^^^^^^^ File "/home/zhoupc/anaconda3/envs/qwen/lib/python3.11/site-packages/torch/_compile.py", line 32, in inner return disable_fn(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/zhoupc/anaconda3/envs/qwen/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py", line 745, in _fn return fn(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^ File "/home/zhoupc/anaconda3/envs/qwen/lib/python3.11/site-packages/torch/distributed/tensor/_api.py", line 346, in __torch_dispatch__ return DTensor._op_dispatcher.dispatch( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/zhoupc/anaconda3/envs/qwen/lib/python3.11/site-packages/torch/distributed/tensor/_dispatch.py", line 164, in dispatch return self._custom_op_handlers[op_call](op_call, args, kwargs) # type: ignore[operator] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/zhoupc/anaconda3/envs/qwen/lib/python3.11/site-packages/torch/distributed/tensor/_tp_conv.py", line 238, in convolution_handler dtensor.DTensor._op_dispatcher.sharding_propagator.propagate(op_info) File "/home/zhoupc/anaconda3/envs/qwen/lib/python3.11/site-packages/torch/distributed/tensor/_sharding_prop.py", line 206, in propagate OutputSharding, self.propagate_op_sharding(op_info.schema) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/zhoupc/anaconda3/envs/qwen/lib/python3.11/site-packages/torch/distributed/tensor/_sharding_prop.py", line 46, in __call__ return self.cache(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/zhoupc/anaconda3/envs/qwen/lib/python3.11/site-packages/torch/distributed/tensor/_sharding_prop.py", line 422, in propagate_op_sharding_non_cached raise RuntimeError( RuntimeError: Sharding propagation failed on op Op(op=aten.convolution.default, args_schema=Spec(R on (5208, 3, 2, 14, 14)), Spec(R on (1280, 3, 2, 14, 14)), None, [2, 14, 14], [0, 0, 0], [1, 1, 1], False, [0, 0, 0], 1 @ mesh: (8,)). Error:

可以确认的是显卡是正常的,因为跑llava-1.5-7B完全没问题

FontMLLM avatar May 23 '25 12:05 FontMLLM

[2025-05-23 20:43:21] ERROR - RUN - run.py: main - 483: Model Qwen2.5-VL-7B-Instruct x Dataset Q-Bench1_TEST combination failed: CUDA error: invalid device ordinal
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.
, skipping this combination.
Traceback (most recent call last):
File "/home/zhoupc/safe_alignment/VLMEvalKit/run.py", line 355, in main
model = infer_data_job(
^^^^^^^^^^^^^^^
File "/home/zhoupc/safe_alignment/VLMEvalKit/vlmeval/inference.py", line 185, in infer_data_job
model = infer_data(
^^^^^^^^^^^
File "/home/zhoupc/safe_alignment/VLMEvalKit/vlmeval/inference.py", line 117, in infer_data
model = supported_VLMmodel_name if isinstance(model, str) else model
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/zhoupc/safe_alignment/VLMEvalKit/vlmeval/vlm/qwen2_vl/model.py", line 303, in init
self.model = MODEL_CLS.from_pretrained(
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/zhoupc/anaconda3/envs/qwen/lib/python3.11/site-packages/transformers/modeling_utils.py", line 309, in _wrapper
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/home/zhoupc/anaconda3/envs/qwen/lib/python3.11/site-packages/transformers/modeling_utils.py", line 4182, in from_pretrained
tp_plan, device_map, device_mesh = initialize_tensor_parallelism(tp_plan, tp_size=None)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/zhoupc/anaconda3/envs/qwen/lib/python3.11/site-packages/transformers/integrations/tensor_parallel.py", line 79, in initialize_tensor_parallelism
current_device.set_device(int(os.environ["LOCAL_RANK"]))
File "/home/zhoupc/anaconda3/envs/qwen/lib/python3.11/site-packages/torch/cuda/init.py", line 476, in set_device
torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

Loading checkpoint shards: 100%|______________________________________________________________________________________________________________________________________________________| 5/5 [00:01<00:00, 2.95it/s] Infer Qwen2.5-VL-7B-Instruct/Q-Bench1_TEST, Rank 0/8: 0%| | 0/187 [00:00<?, ?it/s] [2025-05-23 20:43:23] ERROR - RUN - run.py: main - 483: Model Qwen2.5-VL-7B-Instruct x Dataset Q-Bench1_TEST combination failed: Sharding propagation failed on op Op(op=aten.convolution.default, args_schema=Spec( R on (5208, 3, 2, 14, 14)), Spec(R on (1280, 3, 2, 14, 14)), None, [2, 14, 14], [0, 0, 0], [1, 1, 1], False, [0, 0, 0], 1 @ mesh: (8,)).

FontMLLM avatar May 23 '25 12:05 FontMLLM

same problem👀

XuZhengzhuo avatar May 28 '25 08:05 XuZhengzhuo

是否是transformer版本的问题?qwen2.5-vl对应的版本是transformers==4.51.3

MaoSong2022 avatar May 30 '25 03:05 MaoSong2022

但是也无法使用对应的transformers==4.51.3,会导致报错 Traceback (most recent call last): File "/share/project/cwm/ziyang.yan/yanziyang_cwm/ziyang.yan/zhou/VLM_fast/v2_pruning.py", line 3, in from model.qwen2_5_vl.modeling_qwen2_5_vl import Qwen2_5_VLForConditionalGeneration File "/share/project/cwm/ziyang.yan/yanziyang_cwm/ziyang.yan/zhou/VLM_fast/model/qwen2_5_vl/modeling_qwen2_5_vl.py", line 17, in from transformers.utils import auto_docstring, can_return_tuple, is_torch_flex_attn_available, is_torchdynamo_compiling, logging ImportError: cannot import name 'auto_docstring' from 'transformers.utils' (/share/project/cwm/ziyang.yan/yanziyang_cwm/ziyang.yan/zhou/VLM_fast/FastVLAD/lib/python3.11/site-packages/transformers/utils/init.py)

Junzhou-Chen avatar Jun 24 '25 10:06 Junzhou-Chen

我的解决方案是训练框架的 transforms 和 VLMEvalKit 的 transforms 版本都统一到 4.49.0

XuZhengzhuo avatar Jun 25 '25 02:06 XuZhengzhuo