Handle len(cached_x.grad_fn.next_functions) == 1 in cached_cast
Using apex for mix precision training, and find one case when if x.requires_grad and cached_x.requires_grad:, the tuple cached_x.grad_fn.next_functions contains only one element. In this case, we see the error:
if cached_x.grad_fn.next_functions[1][0].variable is not x:
IndexError: tuple index out of range
This PR deals with this corner case.
@crcrpar @eqy @definitelynotmcarilli @carlc-nv could you please review? I also see a similar issue https://github.com/NVIDIA/apex/issues/1227 and there is a hotfix there https://github.com/NVIDIA/apex/issues/694#issuecomment-918833904 Which one is the proper fix? Can we proceed? thanks.
friendly ping @crcrpar @eqy
Nice fix! looking forward to merging!
After taking this fix, when doing multi-nodes multi-gpu training, the speed is so slow. Anyone could help?
After taking this fix, when doing multi-nodes multi-gpu training, the speed is so slow. Anyone could help?
Just wondering if you find out why thats gonna be slow
After taking this fix, when doing multi-nodes multi-gpu training, the speed is so slow. Anyone could help?
Just wondering if you find out why thats gonna be slow
Are you using --opt_level of some kind that splits the computations, in which case you are probably spending time waiting for results from another process.
Hi, is there any progress on this PR?
Hi, it seems like this fix has not yet been merged into NVIDIA/apex, is there any other known work around? Facing the same error as OP.