lmdeploy 310P疑似不支持dist.broadcast操作，如何支持多卡

Motivation

在ascend 310p上创建以下环境torch=2.3.1+cpu、torch-npu=2.3.1.post4

LMdeploy和dlinfer代码分别使用DeepLink-org:support_310P和yao-fengchen:support_310P ascend 310p上进行多卡的推理任务，当tp=2时

from lmdeploy import pipeline
from lmdeploy import PytorchEngineConfig, GenerationConfig

if __name__ == "__main__":
    pipe = pipeline("/mnt/data/llm/Qwen1.5-7B-Chat/",
                    backend_config=PytorchEngineConfig(
                        tp=2,
                        device_type="ascend",
                        dtype='float16',
                        eager_mode=True,
                        cache_max_entry_count=0.5))
    # question = ["Shanghai is"]
    question = ["Shanghai is the largest city in China. Please introduce it."]
    response = pipe(question, gen_config=GenerationConfig(max_new_tokens=10))
    print(response)

存在以下问题

/home/yzk/.conda/envs/yzk-lmdeploy/lib/python3.11/site-packages/torch_npu/utils/collect_env.py:59: UserWarning: Warning: The /usr/local/Ascend/ascend-toolkit/latest owner does not match the current owner.
  warnings.warn(f"Warning: The {path} owner does not match the current owner.")
/home/yzk/.conda/envs/yzk-lmdeploy/lib/python3.11/site-packages/torch_npu/utils/collect_env.py:59: UserWarning: Warning: The /usr/local/Ascend/ascend-toolkit/8.0.0.alpha001/x86_64-linux/ascend_toolkit_install.info owner does not match the current owner.
  warnings.warn(f"Warning: The {path} owner does not match the current owner.")
/home/yzk/.conda/envs/yzk-lmdeploy/lib/python3.11/site-packages/torch_npu/contrib/transfer_to_npu.py:292: ImportWarning:
    *************************************************************************************************************
    The torch.Tensor.cuda and torch.nn.Module.cuda are replaced with torch.Tensor.npu and torch.nn.Module.npu now..
    The torch.cuda.DoubleTensor is replaced with torch.npu.FloatTensor cause the double type is not supported now..
    The backend in torch.distributed.init_process_group set to hccl now..
    The torch.cuda.* and torch.cuda.amp.* are replaced with torch.npu.* and torch.npu.amp.* now..
    The device parameters have been replaced with npu in the function below:
    torch.logspace, torch.randint, torch.hann_window, torch.rand, torch.full_like, torch.ones_like, torch.rand_like, torch.randperm, torch.arange, torch.frombuffer, torch.normal, torch._empty_per_channel_affine_quantized, torch.empty_strided, torch.empty_like, torch.scalar_tensor, torch.tril_indices, torch.bartlett_window, torch.ones, torch.sparse_coo_tensor, torch.randn, torch.kaiser_window, torch.tensor, torch.triu_indices, torch.as_tensor, torch.zeros, torch.randint_like, torch.full, torch.eye, torch._sparse_csr_tensor_unsafe, torch.empty, torch._sparse_coo_tensor_unsafe, torch.blackman_window, torch.zeros_like, torch.range, torch.sparse_csr_tensor, torch.randn_like, torch.from_file, torch._cudnn_init_dropout_state, torch._empty_affine_quantized, torch.linspace, torch.hamming_window, torch.empty_quantized, torch._pin_memory, torch.autocast, torch.load, torch.Generator, torch.set_default_device, torch.Tensor.new_empty, torch.Tensor.new_empty_strided, torch.Tensor.new_full, torch.Tensor.new_ones, torch.Tensor.new_tensor, torch.Tensor.new_zeros, torch.Tensor.to, torch.nn.Module.to, torch.nn.Module.to_empty
    *************************************************************************************************************

  warnings.warn(msg, ImportWarning)
/home/yzk/.conda/envs/yzk-lmdeploy/lib/python3.11/site-packages/torch_npu/contrib/transfer_to_npu.py:247: RuntimeWarning: torch.jit.script and torch.jit.script_method will be disabled by transfer_to_npu, which currently does not support them, if you need to enable them, please do not use transfer_to_npu.
  warnings.warn(msg, RuntimeWarning)
2025-02-07 20:14:14,095 - lmdeploy - WARNING - transformers.py:22 - LMDeploy requires transformers version: [4.33.0 ~ 4.46.1], but found version: 4.48.0
/home/yzk/.conda/envs/yzk-lmdeploy/lib/python3.11/site-packages/torch_npu/utils/collect_env.py:59: UserWarning: Warning: The /usr/local/Ascend/ascend-toolkit/latest owner does not match the current owner.
  warnings.warn(f"Warning: The {path} owner does not match the current owner.")
/home/yzk/.conda/envs/yzk-lmdeploy/lib/python3.11/site-packages/torch_npu/utils/collect_env.py:59: UserWarning: Warning: The /usr/local/Ascend/ascend-toolkit/8.0.0.alpha001/x86_64-linux/ascend_toolkit_install.info owner does not match the current owner.
  warnings.warn(f"Warning: The {path} owner does not match the current owner.")
/home/yzk/.conda/envs/yzk-lmdeploy/lib/python3.11/site-packages/torch_npu/contrib/transfer_to_npu.py:292: ImportWarning:
    *************************************************************************************************************
    The torch.Tensor.cuda and torch.nn.Module.cuda are replaced with torch.Tensor.npu and torch.nn.Module.npu now..
    The torch.cuda.DoubleTensor is replaced with torch.npu.FloatTensor cause the double type is not supported now..
    The backend in torch.distributed.init_process_group set to hccl now..
    The torch.cuda.* and torch.cuda.amp.* are replaced with torch.npu.* and torch.npu.amp.* now..
    The device parameters have been replaced with npu in the function below:
    torch.logspace, torch.randint, torch.hann_window, torch.rand, torch.full_like, torch.ones_like, torch.rand_like, torch.randperm, torch.arange, torch.frombuffer, torch.normal, torch._empty_per_channel_affine_quantized, torch.empty_strided, torch.empty_like, torch.scalar_tensor, torch.tril_indices, torch.bartlett_window, torch.ones, torch.sparse_coo_tensor, torch.randn, torch.kaiser_window, torch.tensor, torch.triu_indices, torch.as_tensor, torch.zeros, torch.randint_like, torch.full, torch.eye, torch._sparse_csr_tensor_unsafe, torch.empty, torch._sparse_coo_tensor_unsafe, torch.blackman_window, torch.zeros_like, torch.range, torch.sparse_csr_tensor, torch.randn_like, torch.from_file, torch._cudnn_init_dropout_state, torch._empty_affine_quantized, torch.linspace, torch.hamming_window, torch.empty_quantized, torch._pin_memory, torch.autocast, torch.load, torch.Generator, torch.set_default_device, torch.Tensor.new_empty, torch.Tensor.new_empty_strided, torch.Tensor.new_full, torch.Tensor.new_ones, torch.Tensor.new_tensor, torch.Tensor.new_zeros, torch.Tensor.to, torch.nn.Module.to, torch.nn.Module.to_empty
    *************************************************************************************************************

  warnings.warn(msg, ImportWarning)
/home/yzk/.conda/envs/yzk-lmdeploy/lib/python3.11/site-packages/torch_npu/contrib/transfer_to_npu.py:247: RuntimeWarning: torch.jit.script and torch.jit.script_method will be disabled by transfer_to_npu, which currently does not support them, if you need to enable them, please do not use transfer_to_npu.
  warnings.warn(msg, RuntimeWarning)
Loading weights from safetensors:   0%|                                                                                             | 0/4 [00:00<?, ?it/s]Loading weights from safetensors: 100%|█████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:03<00:00,  1.10it/s]
./home/yzk/.conda/envs/yzk-lmdeploy/lib/python3.11/site-packages/torch_npu/distributed/distributed_c10d.py:97: UserWarning: HCCL doesn't support gather at the moment. Implemented with allgather instead.
  warnings.warn("HCCL doesn't support gather at the moment. Implemented with allgather instead.")
2025-02-07 20:14:52,637 - lmdeploy - ERROR - model_agent.py:442 - Rank[1] failed.
Traceback (most recent call last):
  File "/home/yzk/lmdeploy_support_310P/lmdeploy/pytorch/engine/model_agent.py", line 439, in _start_tp_process
    func(rank, *args, **kwargs)
  File "/home/yzk/lmdeploy_support_310P/lmdeploy/pytorch/engine/model_agent.py", line 385, in _tp_model_loop
    patched_model, cache_engine, _ = _tp_build_model(rank,
                                     ^^^^^^^^^^^^^^^^^^^^^
  File "/home/yzk/.conda/envs/yzk-lmdeploy/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/yzk/lmdeploy_support_310P/lmdeploy/pytorch/engine/model_agent.py", line 334, in _tp_build_model
    raise e
  File "/home/yzk/lmdeploy_support_310P/lmdeploy/pytorch/engine/model_agent.py", line 330, in _tp_build_model
    cache_config = _broadcast_config(cache_config)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/yzk/lmdeploy_support_310P/lmdeploy/pytorch/engine/model_agent.py", line 298, in _broadcast_config
    dist.gather_object(cache_config, gathered_configs)
  File "/home/yzk/.conda/envs/yzk-lmdeploy/lib/python3.11/site-packages/torch_npu/distributed/distributed_c10d.py", line 177, in _gather_object
    _gather(
  File "/home/yzk/.conda/envs/yzk-lmdeploy/lib/python3.11/site-packages/torch_npu/distributed/distributed_c10d.py", line 99, in _gather
    dist.broadcast_object_list(recv_size_list, dst, group)
  File "/home/yzk/.conda/envs/yzk-lmdeploy/lib/python3.11/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/yzk/.conda/envs/yzk-lmdeploy/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 2661, in broadcast_object_list
    object_tensor = torch.empty(  # type: ignore[call-overload]
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/yzk/.conda/envs/yzk-lmdeploy/lib/python3.11/site-packages/torch_npu/contrib/transfer_to_npu.py", line 153, in decorated
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
RuntimeError: Trying to create tensor with negative dimension -138608640: [-138608640]
[ERROR] 2025-02-07-20:14:52 (PID:400524, Device:1, RankID:-1) ERR01003 OPS invalid value
2025-02-07 20:15:01,109 - lmdeploy - ERROR - model_agent.py:467 - TP process 0 failed with exitcode 1.
(yzk-lmdeploy) [yzk@devserver-2efb lmdeploy_support_310P]$ [ERROR] TBE Subprocess[task_distribute] raise error[], main process disappeared!
[ERROR] TBE Subprocess[task_distribute] raise error[], main process disappeared!
[ERROR] TBE Subprocess[task_distribute] raise error[], main process disappeared!
[ERROR] TBE Subprocess[task_distribute] raise error[], main process disappeared!
[ERROR] TBE Subprocess[task_distribute] raise error[], main process disappeared!
[ERROR] TBE Subprocess[task_distribute] raise error[], main process disappeared!
[ERROR] TBE Subprocess[task_distribute] raise error[], main process disappeared!
[ERROR] TBE Subprocess[task_distribute] raise error[], main process disappeared!
/home/yzk/.conda/envs/yzk-lmdeploy/lib/python3.11/multiprocessing/resource_tracker.py:254: UserWarning: resource_tracker: There appear to be 34 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '

我们把gather_object_list换成allgather_object以后，_broadcast_config可以正常同步了，然而后面的_broadcast_inputs依旧不行（gather_object_list里面也是有broadcast操作的），具体修改如下所示

    def _broadcast_config_old(cache_config):
        """broadcast cache config, use minimum cache."""
        if rank == 0:
            gathered_configs = [None] * world_size
            dist.gather_object(cache_config, gathered_configs)
            num_gpu_blocks_list = [config.num_gpu_blocks for config in gathered_configs]
            num_cpu_blocks_list = [config.num_cpu_blocks for config in gathered_configs]
            min_num_gpu_blocks = min(num_gpu_blocks_list)
            min_num_cpu_blocks = min(num_cpu_blocks_list)
            cache_config.num_cpu_blocks = min_num_cpu_blocks
            cache_config.num_gpu_blocks = min_num_gpu_blocks
            config_list = [cache_config]
        else:
            gathered_configs = None
            dist.gather_object(cache_config, gathered_configs)
            config_list = [None]
        dist.broadcast_object_list(config_list)
        return config_list[0]
    def _broadcast_config(cache_config):
        print("===yzk debug config",cache_config)
        """broadcast cache config, use minimum cache."""
        gathered_configs = [None] * world_size
        dist.all_gather_object(gathered_configs,cache_config) 
        num_gpu_blocks_list = [
            config.num_gpu_blocks for config in gathered_configs
        ]
        num_cpu_blocks_list = [
            config.num_cpu_blocks for config in gathered_configs
        ]
        min_num_gpu_blocks = min(num_gpu_blocks_list)
        min_num_cpu_blocks = min(num_cpu_blocks_list)
        cache_config.num_cpu_blocks = min_num_cpu_blocks
        cache_config.num_gpu_blocks = min_num_gpu_blocks
        config_list = [cache_config]
        return config_list[0]

运行结果如下所示

2025-02-07 20:21:37,551 - lmdeploy - WARNING - async_engine.py:625 - GenerationConfig: GenerationConfig(n=1, max_new_tokens=50, do_sample=False, top_p=1.0, top_k=50, min_p=0.0, temperature=0.8, repetition_penalty=1.0, ignore_eos=False, random_seed=None, stop_words=None, bad_words=None, stop_token_ids=[151645], bad_token_ids=None, min_new_tokens=None, skip_special_tokens=True, spaces_between_special_tokens=True, logprobs=None, response_format=None, logits_processors=None, output_logits=None, output_last_hidden_state=None)
2025-02-07 20:21:37,551 - lmdeploy - WARNING - async_engine.py:626 - Since v0.6.0, lmdeploy add `do_sample` in GenerationConfig. It defaults to False, meaning greedy decoding. Please set `do_sample=True` if sampling  decoding is needed
/home/yzk/.conda/envs/yzk-lmdeploy/lib/python3.11/site-packages/torch_npu/utils/storage.py:38: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  if self.device.type != 'cpu':
[rank1]:[W AddKernelNpu.cpp:82] Warning: The oprator of add is executed, Currently High Accuracy but Low Performance OP with 64-bit has been used, Please Do Some Cast at Python Functions with 32-bit for Better Performance! (function operator())
.2025-02-07 20:21:38,452 - lmdeploy - ERROR - model_agent.py:458 - Rank[1] failed.
Traceback (most recent call last):
  File "/home/yzk/lmdeploy_support_310P/lmdeploy/pytorch/engine/model_agent.py", line 455, in _start_tp_process
    func(rank, *args, **kwargs)
  File "/home/yzk/lmdeploy_support_310P/lmdeploy/pytorch/engine/model_agent.py", line 416, in _tp_model_loop
    model_forward(
  File "/home/yzk/.conda/envs/yzk-lmdeploy/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/yzk/lmdeploy_support_310P/lmdeploy/pytorch/engine/model_agent.py", line 151, in model_forward
    output = model(**input_dict)
             ^^^^^^^^^^^^^^^^^^^
  File "/home/yzk/lmdeploy_support_310P/lmdeploy/pytorch/backends/graph_runner.py", line 24, in __call__
    return self.model(**kwargs)
           ^^^^^^^^^^^^^^^^^^^^
  File "/home/yzk/.conda/envs/yzk-lmdeploy/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/yzk/.conda/envs/yzk-lmdeploy/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/yzk/lmdeploy_support_310P/lmdeploy/pytorch/models/qwen2.py", line 324, in forward
    hidden_states = self.model(
                    ^^^^^^^^^^^
  File "/home/yzk/.conda/envs/yzk-lmdeploy/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/yzk/.conda/envs/yzk-lmdeploy/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/yzk/lmdeploy_support_310P/lmdeploy/pytorch/models/qwen2.py", line 264, in forward
    hidden_states, residual = decoder_layer(
                              ^^^^^^^^^^^^^^
  File "/home/yzk/.conda/envs/yzk-lmdeploy/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/yzk/.conda/envs/yzk-lmdeploy/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/yzk/lmdeploy_support_310P/lmdeploy/pytorch/models/qwen2.py", line 188, in forward
    hidden_states = self.self_attn(
                    ^^^^^^^^^^^^^^^
  File "/home/yzk/.conda/envs/yzk-lmdeploy/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/yzk/.conda/envs/yzk-lmdeploy/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/yzk/lmdeploy_support_310P/lmdeploy/pytorch/models/qwen2.py", line 77, in forward
    query_states, key_states = self.apply_rotary_pos_emb(
                               ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/yzk/.conda/envs/yzk-lmdeploy/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/yzk/.conda/envs/yzk-lmdeploy/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/yzk/lmdeploy_support_310P/lmdeploy/pytorch/nn/rotary_embedding.py", line 112, in forward
    return self.impl.forward(query, key, cos, sin, inplace)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/yzk/lmdeploy_support_310P/lmdeploy/pytorch/backends/dlinfer/apply_rotary_emb.py", line 21, in forward
    return apply_rotary_pos_emb(query, key, cos, sin, q_embed, k_embed)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/yzk/lmdeploy_support_310P/lmdeploy/pytorch/kernels/dlinfer/apply_rotary_pos_emb.py", line 21, in apply_rotary_pos_emb
    ext_ops.apply_rotary_pos_emb(query_states_reshaped,
  File "/home/yzk/dlinfer_support_310P/dlinfer/graph/custom_op.py", line 70, in patched_func
    return func_with_default(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/yzk/dlinfer_support_310P/dlinfer/ops/llm.py", line 84, in apply_rotary_pos_emb
    return vendor_ops_registry["apply_rotary_pos_emb"](
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/yzk/dlinfer_support_310P/dlinfer/vendor/ascend/torch_npu_ops.py", line 80, in apply_rotary_pos_emb
    return torch.ops.npu.npu_apply_rotary_pos_emb(query, key, cos, sin, "BSND")
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/yzk/.conda/envs/yzk-lmdeploy/lib/python3.11/site-packages/torch/_ops.py", line 854, in __call__
    return self_._op(*args, **(kwargs or {}))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: call aclnnApplyRotaryPosEmb failed, detail:E89999: Inner Error!
E89999: [PID: 408156] 2025-02-07-20:21:38.448.775 op[ApplyRotaryPosEmb], input is empty[FUNC:GetInputParams][FILE:apply_rotary_pos_emb_tiling.cc][LINE:163]
        TraceBack (most recent call last):
       op[ApplyRotaryPosEmb], GetInputParams failed[FUNC:CheckParams][FILE:apply_rotary_pos_emb_tiling.cc][LINE:170]
       op[ApplyRotaryPosEmb], CheckParams return failed.[FUNC:Tiling4ApplyRotaryPosEmb][FILE:apply_rotary_pos_emb_tiling.cc][LINE:447]
       ApplyRotaryPosEmb do tiling failed, ret is -1.
       Check NnopbaseExecutorDoTiling(executor) failed
       Check NnopbaseExecutorTilingAndUpdateBinInfo(executor) failed
       Check NnopbaseRunForWorkspace(*executor, workspaceSize) failed

[ERROR] 2025-02-07-20:21:38 (PID:408156, Device:1, RankID:-1) ERR01100 OPS call acl api failed
[W OpParamMaker.cpp:387] Warning: E90000: [PID: 408156] 2025-02-07-20:21:38.563.236 Compile operator failed, cause: invalid literal for int() with base 10: ''   File "/usr/local/Ascend/ascend-toolkit/latest/python/site-packages/tbe/common/repository_manager/interface.py", line 83, in cann_kb_search_get
    return public_interact_get(key)
           ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/Ascend/ascend-toolkit/latest/python/site-packages/tbe/common/repository_manager/route.py", line 400, in public_interact_get
    status, msg_mgr = get_msg_obj()
                      ^^^^^^^^^^^^^
  File "/usr/local/Ascend/ascend-toolkit/latest/python/site-packages/tbe/common/repository_manager/route.py", line 322, in get_msg_obj
    return client_get_msg_obj(msg_file_path, file_name)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/Ascend/ascend-toolkit/latest/python/site-packages/tbe/common/repository_manager/utils/multiprocess_util.py", line 191, in client_get_msg_obj
    secrets_size = int(fp.read(MARK_DIGEST_SIZE).decode())
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

        TraceBack (most recent call last):
        Fail to get tune knowledge for op[GatherV224, GatherV2].[FUNC:UpdateNodeCompileParams][FILE:op_compiler.cc][LINE:668]
        Call OptimizeFusedGraph failed, ret:4294967295, engine_name:AIcoreEngine, graph_name:partition0_rank1_new_sub_graph3[FUNC:OptimizeSubGraph][FILE:graph_optimize.cc][LINE:119]
        subgraph 0 optimize failed[FUNC:OptimizeSubGraphWithMultiThreads][FILE:graph_manager.cc][LINE:814]
        build graph failed, graph id:23, ret:4294967295[FUNC:BuildModelWithGraphId][FILE:ge_generator.cc][LINE:1615]
        [Build][SingleOpModel]call ge interface generator.BuildSingleOpModel failed. ge result = 4294967295[FUNC:ReportCallError][FILE:log_inner.cpp][LINE:161]
        [Build][Op]Fail to build op model[FUNC:ReportInnerError][FILE:log_inner.cpp][LINE:145]
        build op model failed, result = 500002[FUNC:ReportInnerError][FILE:log_inner.cpp][LINE:145]
 (function ExecFunc)
2025-02-07 20:21:46,365 - lmdeploy - ERROR - model_agent.py:483 - TP process 0 failed with exitcode 1.
(yzk-lmdeploy) [yzk@devserver-2efb lmdeploy_support_310P]$ [ERROR] TBE Subprocess[task_distribute] raise error[], main process disappeared!
[ERROR] TBE Subprocess[task_distribute] raise error[], main process disappeared!
[ERROR] TBE Subprocess[task_distribute] raise error[], main process disappeared!
[ERROR] TBE Subprocess[task_distribute] raise error[], main process disappeared!
[ERROR] TBE Subprocess[task_distribute] raise error[], main process disappeared!
[ERROR] TBE Subprocess[task_distribute] raise error[], main process disappeared!
[ERROR] TBE Subprocess[task_distribute] raise error[], main process disappeared!
[ERROR] TBE Subprocess[task_distribute] raise error[], main process disappeared!
/home/yzk/.conda/envs/yzk-lmdeploy/lib/python3.11/multiprocessing/resource_tracker.py:254: UserWarning: resource_tracker: There appear to be 34 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '

因此，我们猜想是不是310P不支持broadcast等dist操作。使用单独的python脚本测试torch.distributed的all_reduce和broadcast操作，发现all_reduce可以正常执行，broadcast不可以（测试代码见后）。最后，我们问了昇腾官方，说310P还真不支持broadcast。

请问LMDeploy最近有支持310P多卡的计划吗？如果310P不支持dist.broadcast操作的话，应该怎么修改来避免进行broadcast操作呢

使用PyTorch测试HCCL

测试all_reduce

import os
import torch
import torch_npu
import torch.distributed as dist

rank_id = int(os.getenv("LOCAL_RANK", 0))
world_size = int(os.getenv("WORLD_SIZE", 1))  #获取world_size和rank_id
init_method = 'tcp://127.0.0.1:23456'

torch.npu.set_device(rank_id)

dist.init_process_group(backend='hccl',
                        init_method=init_method,
                        world_size=world_size,
                        rank=rank_id)  #初始化通讯组
tensor = torch.ones((8, ), dtype=torch.float16).npu()

print(f"rank_id: {rank_id}, world_size: {world_size}")

print(f"before allreduce: {tensor}")
dist.all_reduce(tensor, op=dist.ReduceOp.SUM)  #allreduce操作
print(f"after allreduce: {tensor}")

LOCAL_RANK=0 WORLD_SIZE=2 python /tmp/allreduce_dist_test.py
LOCAL_RANK=1 WORLD_SIZE=2 python /tmp/allreduce_dist_test.py # 另一个窗口

rank0/1输出正确结果：

before allreduce: tensor([1., 1., 1., 1., 1., 1., 1., 1.], device='npu:0', dtype=torch.float16)
after allreduce: tensor([2., 2., 2., 2., 2., 2., 2., 2.], device='npu:0', dtype=torch.float16)

测试broadcast

import os
import torch
import torch_npu
import torch.distributed as dist

rank_id = int(os.getenv("LOCAL_RANK", 0))
world_size = int(os.getenv("WORLD_SIZE", 1))  #获取world_size和rank_id
init_method = 'tcp://127.0.0.1:23456'

torch.npu.set_device(rank_id)

dist.init_process_group(backend='hccl',
                        init_method=init_method,
                        world_size=world_size,
                        rank=rank_id)  #初始化通讯组
if rank_id == 0:
    tensor = torch.randn((2, 5), dtype=torch.float16).npu()
else:
    tensor = torch.zeros((2, 5), dtype=torch.float16).npu()
tensor = tensor

print(f"rank_id: {rank_id}, world_size: {world_size}")

print(f"before broadcast: {tensor}")
dist.broadcast(tensor, src=0)  # broadcast操作
print(f"after broadcast: {tensor}")

LOCAL_RANK=0 WORLD_SIZE=2 python /tmp/broadcast_dist_test.py
LOCAL_RANK=1 WORLD_SIZE=2 python /tmp/broadcast_dist_test.py # 另一个窗口

rank1输出错误结果：

.before broadcast: tensor([[0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0.]], device='npu:1', dtype=torch.float16)
after broadcast: tensor([[0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0.]], device='npu:1', dtype=torch.float16)

Related resources

No response

Additional context

No response

Feb 07 '25 10:02 yezekun

310P刚刚跑通单卡，多卡问题我们分析下

Feb 08 '25 07:02 jinminxi104

310P刚刚跑通单卡，多卡问题我们分析下

好的，非常感谢，期待LMdeploy未来能在更广的平台上兼容

Feb 08 '25 14:02 yezekun

期待+1

Mar 09 '25 09:03 wangyumu

310P刚刚跑通单卡，多卡问题我们分析下

请问有跑通的镜像吗？我使用官方提供的crpi-4crprmm5baj1v8iv.cn-hangzhou.personal.cr.aliyuncs.com/lmdeploy_dlinfer/ascend:latest 一直报错：

2025-04-10 11:40:39,181 - lmdeploy - INFO - async_engine.py:259 - input backend=pytorch, backend_config=PytorchEngineConfig(dtype='auto', tp=1, dp=1, dp_rank=0, session_len=None, max_batch_size=256, cache_max_entry_count=0.8, prefill_interval=16, block_size=64, num_cpu_blocks=0, num_gpu_blocks=0, adapters=None, max_prefill_token_num=8192, thread_safe=False, enable_prefix_caching=True, device_type='ascend', eager_mode=True, custom_module_map=None, download_dir=None, revision=None, quant_policy=0, distributed_executor_backend=None)
2025-04-10 11:40:39,181 - lmdeploy - INFO - async_engine.py:260 - input chat_template_config=None
[W compiler_depend.ts:615] Warning: expandable_segments currently defaults to false. You can enable this feature by `export PYTORCH_NPU_ALLOC_CONF = expandable_segments:True`. (function operator())
2025-04-10 11:40:45,406 - lmdeploy - INFO - async_engine.py:269 - updated chat_template_onfig=ChatTemplateConfig(model_name='qwen2d5', system=None, meta_instruction=None, eosys=None, user=None, eoh=None, assistant=None, eoa=None, tool=None, eotool=None, separator=None, capability=None, stop_words=None)
/usr/local/python3.10.5/lib/python3.10/site-packages/torch_npu/contrib/transfer_to_npu.py:301: ImportWarning: 
    *************************************************************************************************************
    The torch.Tensor.cuda and torch.nn.Module.cuda are replaced with torch.Tensor.npu and torch.nn.Module.npu now..
    The torch.cuda.DoubleTensor is replaced with torch.npu.FloatTensor cause the double type is not supported now..
    The backend in torch.distributed.init_process_group set to hccl now..
    The torch.cuda.* and torch.cuda.amp.* are replaced with torch.npu.* and torch.npu.amp.* now..
    The device parameters have been replaced with npu in the function below:
    torch.logspace, torch.randint, torch.hann_window, torch.rand, torch.full_like, torch.ones_like, torch.rand_like, torch.randperm, torch.arange, torch.frombuffer, torch.normal, torch._empty_per_channel_affine_quantized, torch.empty_strided, torch.empty_like, torch.scalar_tensor, torch.tril_indices, torch.bartlett_window, torch.ones, torch.sparse_coo_tensor, torch.randn, torch.kaiser_window, torch.tensor, torch.triu_indices, torch.as_tensor, torch.zeros, torch.randint_like, torch.full, torch.eye, torch._sparse_csr_tensor_unsafe, torch.empty, torch._sparse_coo_tensor_unsafe, torch.blackman_window, torch.zeros_like, torch.range, torch.sparse_csr_tensor, torch.randn_like, torch.from_file, torch._cudnn_init_dropout_state, torch._empty_affine_quantized, torch.linspace, torch.hamming_window, torch.empty_quantized, torch._pin_memory, torch.autocast, torch.load, torch.Generator, torch.Tensor.new_empty, torch.Tensor.new_empty_strided, torch.Tensor.new_full, torch.Tensor.new_ones, torch.Tensor.new_tensor, torch.Tensor.new_zeros, torch.Tensor.to, torch.nn.Module.to, torch.nn.Module.to_empty
    *************************************************************************************************************
    
  warnings.warn(msg, ImportWarning)
/usr/local/python3.10.5/lib/python3.10/site-packages/torch_npu/contrib/transfer_to_npu.py:260: RuntimeWarning: torch.jit.script and torch.jit.script_method will be disabled by transfer_to_npu, which currently does not support them, if you need to enable them, please do not use transfer_to_npu.
  warnings.warn(msg, RuntimeWarning)
2025-04-10 11:40:49,063 - lmdeploy - ERROR - base.py:53 - RuntimeError: call aclnnAdd failed, detail:EZ9999: Inner Error!
EZ9999: [PID: 338] 2025-04-10-11:40:49.056.766 Parse dynamic kernel config fail.
        TraceBack (most recent call last):
       AclOpKernelInit failed opType
       AddAiCore ADD_TO_LAUNCHER_LIST_AICORE failed.

[ERROR] 2025-04-10-11:40:49 (PID:338, Device:0, RankID:-1) ERR01005 OPS internal error
2025-04-10 11:40:49,063 - lmdeploy - ERROR - base.py:54 - <PyTorch> check failed!
PyTorch is not available.
/usr/local/python3.10.5/lib/python3.10/tempfile.py:837: ResourceWarning: Implicitly cleaning up <TemporaryDirectory '/tmp/tmp6lu__ztv'>
  _warnings.warn(warn_message, ResourceWarning)

Apr 10 '25 11:04 Tendo33

310P刚刚跑通单卡，多卡问题我们分析下

请问有跑通的镜像吗？我使用官方提供的crpi-4crprmm5baj1v8iv.cn-hangzhou.personal.cr.aliyuncs.com/lmdeploy_dlinfer/ascend:latest 一直报错：

2025-04-10 11:40:39,181 - lmdeploy - INFO - async_engine.py:259 - input backend=pytorch, backend_config=PytorchEngineConfig(dtype='auto', tp=1, dp=1, dp_rank=0, session_len=None, max_batch_size=256, cache_max_entry_count=0.8, prefill_interval=16, block_size=64, num_cpu_blocks=0, num_gpu_blocks=0, adapters=None, max_prefill_token_num=8192, thread_safe=False, enable_prefix_caching=True, device_type='ascend', eager_mode=True, custom_module_map=None, download_dir=None, revision=None, quant_policy=0, distributed_executor_backend=None)
2025-04-10 11:40:39,181 - lmdeploy - INFO - async_engine.py:260 - input chat_template_config=None
[W compiler_depend.ts:615] Warning: expandable_segments currently defaults to false. You can enable this feature by `export PYTORCH_NPU_ALLOC_CONF = expandable_segments:True`. (function operator())
2025-04-10 11:40:45,406 - lmdeploy - INFO - async_engine.py:269 - updated chat_template_onfig=ChatTemplateConfig(model_name='qwen2d5', system=None, meta_instruction=None, eosys=None, user=None, eoh=None, assistant=None, eoa=None, tool=None, eotool=None, separator=None, capability=None, stop_words=None)
/usr/local/python3.10.5/lib/python3.10/site-packages/torch_npu/contrib/transfer_to_npu.py:301: ImportWarning: 
    *************************************************************************************************************
    The torch.Tensor.cuda and torch.nn.Module.cuda are replaced with torch.Tensor.npu and torch.nn.Module.npu now..
    The torch.cuda.DoubleTensor is replaced with torch.npu.FloatTensor cause the double type is not supported now..
    The backend in torch.distributed.init_process_group set to hccl now..
    The torch.cuda.* and torch.cuda.amp.* are replaced with torch.npu.* and torch.npu.amp.* now..
    The device parameters have been replaced with npu in the function below:
    torch.logspace, torch.randint, torch.hann_window, torch.rand, torch.full_like, torch.ones_like, torch.rand_like, torch.randperm, torch.arange, torch.frombuffer, torch.normal, torch._empty_per_channel_affine_quantized, torch.empty_strided, torch.empty_like, torch.scalar_tensor, torch.tril_indices, torch.bartlett_window, torch.ones, torch.sparse_coo_tensor, torch.randn, torch.kaiser_window, torch.tensor, torch.triu_indices, torch.as_tensor, torch.zeros, torch.randint_like, torch.full, torch.eye, torch._sparse_csr_tensor_unsafe, torch.empty, torch._sparse_coo_tensor_unsafe, torch.blackman_window, torch.zeros_like, torch.range, torch.sparse_csr_tensor, torch.randn_like, torch.from_file, torch._cudnn_init_dropout_state, torch._empty_affine_quantized, torch.linspace, torch.hamming_window, torch.empty_quantized, torch._pin_memory, torch.autocast, torch.load, torch.Generator, torch.Tensor.new_empty, torch.Tensor.new_empty_strided, torch.Tensor.new_full, torch.Tensor.new_ones, torch.Tensor.new_tensor, torch.Tensor.new_zeros, torch.Tensor.to, torch.nn.Module.to, torch.nn.Module.to_empty
    *************************************************************************************************************
    
  warnings.warn(msg, ImportWarning)
/usr/local/python3.10.5/lib/python3.10/site-packages/torch_npu/contrib/transfer_to_npu.py:260: RuntimeWarning: torch.jit.script and torch.jit.script_method will be disabled by transfer_to_npu, which currently does not support them, if you need to enable them, please do not use transfer_to_npu.
  warnings.warn(msg, RuntimeWarning)
2025-04-10 11:40:49,063 - lmdeploy - ERROR - base.py:53 - RuntimeError: call aclnnAdd failed, detail:EZ9999: Inner Error!
EZ9999: [PID: 338] 2025-04-10-11:40:49.056.766 Parse dynamic kernel config fail.
        TraceBack (most recent call last):
       AclOpKernelInit failed opType
       AddAiCore ADD_TO_LAUNCHER_LIST_AICORE failed.

[ERROR] 2025-04-10-11:40:49 (PID:338, Device:0, RankID:-1) ERR01005 OPS internal error
2025-04-10 11:40:49,063 - lmdeploy - ERROR - base.py:54 - <PyTorch> check failed!
PyTorch is not available.
/usr/local/python3.10.5/lib/python3.10/tempfile.py:837: ResourceWarning: Implicitly cleaning up <TemporaryDirectory '/tmp/tmp6lu__ztv'>
  _warnings.warn(warn_message, ResourceWarning)

此镜像仅支持在910B使用，910B和310P的CANN软件包不同，且该镜像中的lmdeploy+dlinfer版本未支持310P。我们近期正在整理310P的代码，后续会合入310P的支持。

Apr 11 '25 10:04 yao-fengchen

目前main分支已合入310P图模式，并且支持多卡

May 19 '25 07:05 yao-fengchen