[FEA] FP8 GEMM implementation
Is your feature request related to a problem? Please describe. Recently more details about Nvidia's latest H100 GPU are released in https://developer.nvidia.com/blog/nvidia-hopper-architecture-in-depth/ . Tensor Core will support FP8 E4M3 and E5M2 format. Wonder if CUTLASS is going to provide the FP8 GEMM implementation soon.
Describe the solution you'd like Ideally we want FP8 GEMM implementation with FP16/FP32 accumulate for both E4M3 and E5M2 format.
Describe alternatives you've considered This is new feature on new Nvidia hardware.
Additional context None.
Yes, after the toolkit supports it.
@jianyuh is your interest in dense, sparse, or both?
@mnicely Thanks for checking. We are mostly interested in dense for the 1st step.
This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.
This issue has been labeled inactive-90d due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.