Autumn1998

Results 4 issues of Autumn1998

在成功构建docker之后,运行deep-100M的benchmark出现stat error: ``` E0218 09:37:10.278347 1 hierarchical_cluster_index.cpp:641] model file data/deep- 100M.C3000_F3000_FN16_Flat.puckindex/index.dat stat error ``` 并且后面出现了file not found的错误: ``` E0218 09:37:10.278366 1 py_api_wrapper.cpp:97] load index Faild Traceback (most recent call last):...

# Description Make the Grouped linear accept the blockwise fp8 input Rely on https://github.com/NVIDIA/TransformerEngine/pull/1707 for the compact scaling factors. TODO: - [ ] Change the kernel to use compact scaling...

# What does this PR do ? To make use of the latest EP. In this version, the 2 synchronizations are grouped into 1, and remove num_dispatched_tokens ## Contribution process...

1. Add topo-detection for RDMA 2. Optimize the perf in small hidden/large EP 3. Use ~/.deepep/hybrid_ep/jit as the jit path 4. Group 2 sync into 1, and move sync point...