bdf

Results 16 comments of bdf

> [@defei-coder](https://github.com/defei-coder) The latest code has changed the transport of normal kernel from IBRC to IBGDA, and these logs show that IBGDA is not functioning properly in your environment >...

Hi @LyricZhao , I used tests/test_internode.py, the measured IB bandwidth using the original code under EP64 is 45 GB/s which closed to github performance, while the bandwidth tested with the...

> If the tokens are evenly distributed, I guess it should be `45 * 7/8 = 39.375`? Anyway, there is still some space for optimization, we will refactor the code...

> DeepEP’s low-latency kernels use cooperative launch to attempt launching a large number of SMs simultaneously. If NCCL occupies some of the SMs, it may prevent DeepEP’s kernels from being...

A question about “If NCCL occupies some of the SMs, it may prevent DeepEP’s kernels from being launched“, Why can't deepEP wait for NCCL to finish?

Thaks for your reply @xiaofanl-nvidia . In my view, low_latency_dispatch differs from all_gather in that low_latency_dispatch accomplishes communication waiting through hook functions, eliminating the need for all ranks to execute...