DeepEP Question about the global synchronization/barrier in internode notify

Hi, deepep team.

There are 3 global synchronization/barrier in internode notify_dispatch.

At the beginning of the notify_dispatch, SM0 performs a global barrier (the first warp do intra-node sync, the second warp do internode sync)
After calling nvshmem_int_put_nbi() to send numbers of tokens per rank/expert to RDMA ranks, SM0 performs a global barrier again.
At the end of the notify_dispatch, SM0 performs the last global barrier.

I roughly understand that the purpose of the second global barrier is to ensure all the rdma operations have been successfully completed and the data has been written to the memory of the destination rdma ranks. So what is the purpose of the other two global barrier? one is at the beginning of the function, and the other is at the end.

As far as I know, a global barrier in nvshmem is a collective communication, and is time-consuming. In the profile file you provided, it also proves that notify_dispatch takes about 1ms.

So my question is, is there a chance to remove the global barrier at the beginning and end of the notify_dispatch?

@LyricZhao @sphish friendly ping

Mar 26 '25 02:03 songhexiang

The first barrier ensures the last dispatch/combine is finished, as we are cleaning the head/tail/other metadata, otherwise the unfinished ranks will read illegal things.

The last barriers ensures the head/tail/other metadata are cleaned before actually running the dispatch/combine, otherwise the faster ranks will read uncleaned metadata.

There is some space to optimize for these parts, but may need some refactor, we plan to do it maybe several months later.

Mar 26 '25 03:03 LyricZhao

Thanks for your prompt response!

Mar 26 '25 03:03 songhexiang

Can I understand that notify_dispatch and dispatch share the same rdma buffer(void* rdma_buffer_ptr) and nvlink buffer(void** buffer_ptrs). These two global barriers are used to ensure that the rdma buffer and nvlink buffer are cleaned up for the next opertion.

Mar 26 '25 03:03 songhexiang

Yes, the data buffer is reused as you said. But another reason is that one of the notify's responsibility is to clean head/tail/metadata for later dispatch.

Mar 26 '25 03:03 LyricZhao

Hi teams:

After the notify_dispatch, there is a CPU synchronize using wile loop which could cause problem when make DeepEP overlap with other compute kernels. what's more, sometimes notify_dispatch occupy several milliseconds which is much longer than normal (~100us), so I have some related questions:

the CPU sync is necessary here? any possibility to rm this sync?
why sometimes the notify_dispatch consumed more time than normal case?

Thanks~

Apr 17 '25 09:04 Autumn1998

Hi teams:

After the notify_dispatch, there is a CPU synchronize using wile loop which could cause problem when make DeepEP overlap with other compute kernels. what's more, sometimes notify_dispatch occupy several milliseconds which is much longer than normal (~100us), so I have some related questions:

the CPU sync is necessary here? any possibility to rm this sync?

why sometimes the notify_dispatch consumed more time than normal case?

Thanks~

same question，have you got the answer ?

Oct 16 '25 03:10 terminator123

Question about the global synchronization/barrier in internode notify_dispatch

There are 3 global synchronization/barrier in internode notify_dispatch.