TransformerEngine icon indicating copy to clipboard operation
TransformerEngine copied to clipboard

TE 1.4 fused attn caused a NaN when backward

Open Jack47 opened this issue 1 year ago • 1 comments

I met an nan using TE 1.4, after I use pytorch autograd.detect_anomaly, it captured the NaN traceback:

Error detected in FusedAttnFuncBackward.
...
[rank5]: RuntimeError: Function 'FusedAttnFuncBackward' returned nan values in its 2th output.

When I install TE 1.7, the nan disappeared.

Jack47 avatar Aug 13 '24 05:08 Jack47

@cyanguwa Was there any bug fix in fused attention backend's backward between these 2 releases that could explain this?

ksivaman avatar Aug 15 '24 18:08 ksivaman