TransformerEngine icon indicating copy to clipboard operation
TransformerEngine copied to clipboard

change softmax_lse correction of CP to FP32

Open xrennvidia opened this issue 11 months ago • 1 comments

Description

  • softmax_lse correction is in FP64 now, we can lower it to FP32.
  • use log1p to be consistent with PR1401.

Type of change

  • [ ] Documentation change (change only to the documentation, either a fix or a new content)
  • [ ] Bug fix (non-breaking change which fixes an issue)
  • [ ] New feature (non-breaking change which adds functionality)
  • [ ] Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • [ ] Infra/Build change
  • [x] Code refactoring

Checklist:

  • [x] I have read and followed the contributing guidelines
  • [x] The functionality is complete
  • [x] I have commented my code, particularly in hard-to-understand areas
  • [ ] I have made corresponding changes to the documentation
  • [x] My changes generate no new warnings
  • [ ] I have added tests that prove my fix is effective or that my feature works
  • [x] New and existing unit tests pass locally with my changes

xrennvidia avatar Mar 07 '25 03:03 xrennvidia

/te-ci pytorch L1

xrennvidia avatar Mar 07 '25 21:03 xrennvidia

/te-ci pytorch L1

xrennvidia avatar Apr 28 '25 20:04 xrennvidia

/te-ci pytorch L1

xrennvidia avatar Apr 29 '25 02:04 xrennvidia

CI failures are not related to this PR.

log1p(x) provides more accuracy when x is close to 0.

Downcasting softmax_lse from double to float because cuDNN softmax_lse is in FP32 and combination of multiple copies of them over CP ranks should still stay in the float limits (see here).

cyanguwa avatar Apr 29 '25 22:04 cyanguwa