TransformerEngine
TransformerEngine copied to clipboard
TE 1.4 fused attn caused a NaN when backward
I met an nan using TE 1.4, after I use pytorch autograd.detect_anomaly, it captured the NaN traceback:
Error detected in FusedAttnFuncBackward.
...
[rank5]: RuntimeError: Function 'FusedAttnFuncBackward' returned nan values in its 2th output.
When I install TE 1.7, the nan disappeared.
@cyanguwa Was there any bug fix in fused attention backend's backward between these 2 releases that could explain this?