Megatron-LM icon indicating copy to clipboard operation
Megatron-LM copied to clipboard

[BUG] pipeline_paralle is not available when pp_size > 2

Open qby10 opened this issue 1 year ago • 2 comments

This func is wrong, the program will hang because of the "group" variable.

def _batched_p2p_ops( *, tensor_send_prev: Optional[torch.Tensor], tensor_recv_prev: Optional[torch.Tensor], tensor_send_next: Optional[torch.Tensor], tensor_recv_next: Optional[torch.Tensor], group: torch.distributed.ProcessGroup )

after modified:

def _batched_p2p_ops( *, tensor_send_prev: Optional[torch.Tensor], tensor_recv_prev: Optional[torch.Tensor], tensor_send_next: Optional[torch.Tensor], tensor_recv_next: Optional[torch.Tensor] )

qby10 avatar Jun 17 '24 13:06 qby10

Please add instructions how to reproduce.

elliottnv avatar Aug 07 '24 18:08 elliottnv

Marking as stale. No activity in 60 days.

github-actions[bot] avatar Oct 07 '24 18:10 github-actions[bot]

This issue was closed because it has been inactive for 7 days since being marked as stale.

github-actions[bot] avatar Aug 03 '25 02:08 github-actions[bot]