feat: allreduce and fusion kernel development
Currently, it is still in the draft stage. The completed parts are:
- Fixed the sync error in the twoshot sync kernel.
- Removed the poorly performing oneshot sync kernel.
- Added support for FP32 data type in the existing kernel (FP4Quant fusion is not supported in this case).
- Added support for FP8Quant.
- Added support for non-fusion.
- Added support for pre-hopper architecture.
Todo:
- Add new test cases to the C++ unit tests and adapt the old test cases to the latest changes.
- Adapt the corresponding Torch OP to the latest changes.
- @Kefeng-Duan @zongfeijing for vis about this MR.
/bot run --disable-fail-fast
PR_Github #882 [ run ] triggered by Bot
PR_Github #882 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #698 completed with status: 'FAILURE'
/bot run --disable-fail-fast
PR_Github #900 [ run ] triggered by Bot
PR_Github #900 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #708 completed with status: 'SUCCESS'
/bot run
PR_Github #965 [ run ] triggered by Bot
PR_Github #965 [ run ] completed with state FAILURE
/LLM/main/L0_MergeRequest_PR pipeline #752 completed with status: 'FAILURE'
/bot run --disable-fail-fast
PR_Github #988 [ run ] triggered by Bot
PR_Github #988 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #767 completed with status: 'FAILURE'
/bot run
PR_Github #1050 [ run ] triggered by Bot
PR_Github #1050 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #807 completed with status: 'FAILURE'
/bot run --disable-fail-fast
PR_Github #1384 [ run ] triggered by Bot
PR_Github #1384 [ run ] completed with state FAILURE
/LLM/main/L0_MergeRequest_PR pipeline #1040 completed with status: 'FAILURE'
/bot run --disable-fail-fast
PR_Github #1390 [ run ] triggered by Bot
PR_Github #1390 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #1044 completed with status: 'SUCCESS'
/bot reuse-pipeline
/bot reuse-pipeline
PR_Github #1451 [ reuse-pipeline ] triggered by Bot
/bot reuse-pipeline
PR_Github #1452 [ reuse-pipeline ] triggered by Bot
PR_Github #1452 [ reuse-pipeline ] completed with state ABORTED
Can't reuse PR_Github #0 with status: UNKNOWN
/bot reuse-pipeline
PR_Github #1454 [ reuse-pipeline ] triggered by Bot