TransformerEngine
TransformerEngine copied to clipboard
[Feature Request] Any roadmap for supporting FP8 attention calculation?
There is only FP16/BF16 being supported in class FusedAttention.
I think this commit added fp8 fused attention: https://github.com/NVIDIA/TransformerEngine/commit/989a53a06478a4223ffb2fc2fc92b5febcf9d8c1#diff-236e240f7f5506de96cfd5f61c77c7142905dabada33f6f0c68094724dbfb9b4
FP8 attention is supported for Delayed Scaling. See the fp8_dpa and fp8_mha arguments in the recipe docs.