VLMEvalKit 78B模型推理加速（78B Modeling Reasoning Acceleration）

您好，我现在有八张H100想评测internvl2_5-78B. 目前是通过AUTO_SPLIT拆分模型到三张卡上python推理，请问可以在此基础上加快速度变成ddp嘛

Hello, I now have eight H100 want to review internvl2_5-78B. Currently through the AUTO_SPLIT split model to three cards on the python reasoning, please ask can be based on this to speed up to ddp?

Feb 25 '25 02:02 zfr00

您好，

DDP（Distributed Data Parallel）策略通常用于模型训练时的并行化处理。对于 Internvl2.5-78B 模型，如果您希望进行并行推理，可以尝试使用torchrun --nproc_per_node=2来部署。通过该命令，您可以同时部署两个模型。按照您之前提到的切分策略，每个模型实例会被分配到三张 GPU 卡上运行。因此，在这种情况下，总共会占用六张卡的资源。当然，具体的部署策略可以根据您机器的实际硬件资源上限进行灵活调整。

Feb 25 '25 03:02 PhoenixZ810

那请问AOTU_SPLIT这个参数怎么设置呢，我在将原来的python换成torchrun --nproc_per_node=2会oom

Feb 25 '25 06:02 zfr00

我的sh脚本是这样的： export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 export AUTO_SPLIT=1 torchrun --nproc_per_node=4 run.py --data "$data" --model Qwen2.5-VL-72B-Instruct --verbose --mode infer --reuse，看起来只用了前四张卡，没有像python那样进行模型拆分

Feb 25 '25 06:02 zfr00

对于Qwen2.5-72B模型而言，具体的split_model定义了模型的切分方法。对于72B模型，只用两张卡切分，很容易导致OOM问题。一般建议至少四张卡进行推理。

Feb 25 '25 06:02 PhoenixZ810

很抱歉，这样的代码也会报错 export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 export AUTO_SPLIT=1 torchrun --nproc_per_node=2 run.py --data "$data" --model Qwen2.5-VL-72B-Instruct --verbose --mode infer --reuse AUTO_SPLIT的设置只能为1嘛

Feb 25 '25 06:02 zfr00

您的报错仍旧是OOM吗？那建议您使用--nproc_per_node=1试试

Feb 25 '25 06:02 PhoenixZ810

很抱歉，这样的代码也会报错 export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 export AUTO_SPLIT=1 torchrun --nproc_per_node=2 run.py --data "$data" --model Qwen2.5-VL-72B-Instruct --verbose --mode infer --reuse AUTO_SPLIT的设置只能为1嘛

我也试了看起来只用了前两张卡，没有像python那样进行模型拆分

Aug 12 '25 11:08 DQYZHWK