Yizhou Wang
Yizhou Wang
Hi, while enabling TensorParallel=2 and ZeroStage3 on multi-node training for Megatron-DeepSpeed, I encountered error on this bcast _RuntimeError: Global rank 0 is not part of group raise RuntimeError(f"Global rank {global_rank}...
launcher/multinode_runner.py: mapping env variables in running cmd for mpich runner. Previously, launching deepspeed with mpich could not properly set env variables like "RANK", "LOCAL_RANK", "WORLD_SIZE" and "LOCAL_SIZE", which deepspeed would...
Hi, we are looking at deepspeed.ops.sparse_attention and find out that current SA is based on triton==1.0.0, which is old version. Current triton is 2.x and our supported version is 2.x....
File Changes: multinode_runner.py: modify mpich runner to use launcher_helper launcher_helper.py: init script to map env variables Descriptions: Previous mpich runner would cause linux command line reaching size limitations when rank...
#Motivation From our next release, xpu DeepSpeed related kernels would be put into intel_extension_for_pytorch. This PR is to add new op builders and use kernel path from intel_extension_for_pytorch. More ops...