apex icon indicating copy to clipboard operation
apex copied to clipboard

Handle len(cached_x.grad_fn.next_functions) == 1 in cached_cast

Open jiafatom opened this issue 4 years ago • 8 comments

Using apex for mix precision training, and find one case when if x.requires_grad and cached_x.requires_grad:, the tuple cached_x.grad_fn.next_functions contains only one element. In this case, we see the error:

    if cached_x.grad_fn.next_functions[1][0].variable is not x:
IndexError: tuple index out of range

This PR deals with this corner case.

jiafatom avatar Jan 27 '22 22:01 jiafatom

@crcrpar @eqy @definitelynotmcarilli @carlc-nv could you please review? I also see a similar issue https://github.com/NVIDIA/apex/issues/1227 and there is a hotfix there https://github.com/NVIDIA/apex/issues/694#issuecomment-918833904 Which one is the proper fix? Can we proceed? thanks.

jiafatom avatar Jan 27 '22 22:01 jiafatom

friendly ping @crcrpar @eqy

jiafatom avatar Feb 02 '22 01:02 jiafatom

Nice fix! looking forward to merging!

xvjiarui avatar Feb 16 '22 05:02 xvjiarui

After taking this fix, when doing multi-nodes multi-gpu training, the speed is so slow. Anyone could help?

shizhediao avatar Feb 24 '22 06:02 shizhediao

After taking this fix, when doing multi-nodes multi-gpu training, the speed is so slow. Anyone could help?

Just wondering if you find out why thats gonna be slow

tianda-cerebras avatar May 13 '22 16:05 tianda-cerebras

After taking this fix, when doing multi-nodes multi-gpu training, the speed is so slow. Anyone could help?

Just wondering if you find out why thats gonna be slow

Are you using --opt_level of some kind that splits the computations, in which case you are probably spending time waiting for results from another process.

nachi9211 avatar Aug 14 '22 04:08 nachi9211

Hi, is there any progress on this PR?

mindest avatar Aug 24 '22 03:08 mindest

Hi, it seems like this fix has not yet been merged into NVIDIA/apex, is there any other known work around? Facing the same error as OP.

pschydlo avatar Oct 27 '22 12:10 pschydlo