Question about the global synchronization/barrier in internode notify_dispatch
Hi, deepep team.
There are 3 global synchronization/barrier in internode notify_dispatch.
- At the beginning of the notify_dispatch, SM0 performs a global barrier (the first warp do intra-node sync, the second warp do internode sync)
- After calling nvshmem_int_put_nbi() to send numbers of tokens per rank/expert to RDMA ranks, SM0 performs a global barrier again.
- At the end of the notify_dispatch, SM0 performs the last global barrier.
I roughly understand that the purpose of the second global barrier is to ensure all the rdma operations have been successfully completed and the data has been written to the memory of the destination rdma ranks. So what is the purpose of the other two global barrier? one is at the beginning of the function, and the other is at the end.
As far as I know, a global barrier in nvshmem is a collective communication, and is time-consuming. In the profile file you provided, it also proves that notify_dispatch takes about 1ms.
So my question is, is there a chance to remove the global barrier at the beginning and end of the notify_dispatch?
@LyricZhao @sphish friendly ping
The first barrier ensures the last dispatch/combine is finished, as we are cleaning the head/tail/other metadata, otherwise the unfinished ranks will read illegal things.
The last barriers ensures the head/tail/other metadata are cleaned before actually running the dispatch/combine, otherwise the faster ranks will read uncleaned metadata.
There is some space to optimize for these parts, but may need some refactor, we plan to do it maybe several months later.
Thanks for your prompt response!
Can I understand that notify_dispatch and dispatch share the same rdma buffer(void* rdma_buffer_ptr) and nvlink buffer(void** buffer_ptrs). These two global barriers are used to ensure that the rdma buffer and nvlink buffer are cleaned up for the next opertion.
Yes, the data buffer is reused as you said. But another reason is that one of the notify's responsibility is to clean head/tail/metadata for later dispatch.
Hi teams:
After the notify_dispatch, there is a CPU synchronize using wile loop which could cause problem when make DeepEP overlap with other compute kernels. what's more, sometimes notify_dispatch occupy several milliseconds which is much longer than normal (~100us), so I have some related questions:
- the CPU sync is necessary here? any possibility to rm this sync?
- why sometimes the
notify_dispatchconsumed more time than normal case?
Thanks~
Hi teams:
After the
notify_dispatch, there is a CPU synchronize using wile loop which could cause problem when make DeepEP overlap with other compute kernels. what's more, sometimesnotify_dispatchoccupy several milliseconds which is much longer than normal (~100us), so I have some related questions:
- the CPU sync is necessary here? any possibility to rm this sync?
- why sometimes the
notify_dispatchconsumed more time than normal case?Thanks~
same question,have you got the answer ?