Wenbo Yang
Wenbo Yang
Inspired by https://github.com/NVIDIA/cutlass/pull/1932 and https://github.com/NVIDIA/cutlass/pull/2037, implement blockscaling kernel on platforms before SM90. * FP8 blockwise/groupwise scaling kernel for Ada(L20, L40S, 4090) (Requires accumulator type to be `float`) * INT8 blockwise/groupwise...
Add the interface types and functions to template class `epilogue::thread::Convert` which is required by `DefaultEpilogue`. Testing code: ```c++ using CollectiveEpilogue = cutlass::epilogue::collective::DefaultEpilogue< ElementD, cutlass::detail::TagToStrideC_t, cutlass::detail::TagToStrideC_t, cutlass::epilogue::thread::Convert, cutlass::gemm::EpilogueDefault>; ```