veScale icon indicating copy to clipboard operation
veScale copied to clipboard

[QUESTION] torch broadcast error

Open sallyjunjun opened this issue 1 year ago • 1 comments

When I run run_open_llama_w_vescale.py with torch version 2.5.1+cu124, I met the following error:

[rank4]: Traceback (most recent call last): [rank4]: File "/code/veScale/examples/open_llama_4D_benchmark/run_open_llama_w_vescale-ljx.py", line 104, in [rank4]: vescale_model = parallelize_module(model, device_mesh["TP"], sharding_plan) [rank4]: File "/code/veScale/vescale/dmodule/api.py", line 276, in parallelize_module [rank4]: DModule.init_parameters(module, is_model_sharded) [rank4]: File "/code/veScale/vescale/dmodule/_dmodule.py", line 302, in init_parameters [rank4]: buffer = DModule._distribute_parameter(buffer, module._device_mesh, buffer_pi, is_sharded) [rank4]: File "/miniconda3-new/envs/llm-cuda12.4-vescale/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context [rank4]: return func(*args, **kwargs) [rank4]: File "/code/veScale/vescale/dmodule/_dmodule.py", line 266, in _distribute_parameter [rank4]: dt = distribute_tensor(t, device_mesh, pi.placements) [rank4]: File "/code/veScale/vescale/dtensor/api.py", line 252, in distribute_tensor [rank4]: local_tensor = _replicate_tensor(local_tensor, device_mesh, idx) [rank4]: File "/code/veScale/vescale/dtensor/redistribute.py", line 191, in _replicate_tensor [rank4]: tensor = mesh_broadcast(tensor, mesh, mesh_dim=mesh_dim) [rank4]: File "/code/veScale/vescale/dtensor/_collective_utils.py", line 273, in mesh_broadcast [rank4]: aysnc_tensor = funcol.broadcast(tensor, src=src_for_dim, group=dim_group) [rank4]: File "/miniconda3-new/envs/llm-cuda12.4-vescale/lib/python3.10/site-packages/torch/distributed/_functional_collectives.py", line 153, in broadcast [rank4]: tensor = torch.ops._c10d_functional.broadcast(self, src, group_name) [rank4]: File "/miniconda3-new/envs/llm-cuda12.4-vescale/lib/python3.10/site-packages/torch/_ops.py", line 1116, in call [rank4]: return self._op(*args, **(kwargs or {})) [rank4]: torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:2777, invalid argument (run with NCCL_DEBUG=WARN for details), NCCL version 2.21.5 [rank4]: ncclInvalidArgument: Invalid value for an argument. [rank4]: Last error: [rank4]: Broadcast : invalid root 4 (root should be in the 0..4 range)

Is this because the torch version is not compatible?

sallyjunjun avatar Feb 28 '25 11:02 sallyjunjun

一样的问题,请问你解决了吗

mhqmhy avatar Apr 09 '25 06:04 mhqmhy