Jaemin Choi
Jaemin Choi
Tests successful on OLCF Summit: `jsrun -n2 -a1 -c2 -g1 --smpiargs="-disable_gpu_hooks" ./bandwidth +ppn 1 +pemap L0 +commap L1` Need to add documentation.
Still hangs on OLCF Summit.
Still hangs on Summit.
Once `[iter]` (reference number matching) is removed from `recv()`, the code works fine.
@stwhite91 Thanks for pointing that out, I'm not sure yet as to what's exactly different but that test passes and this one hangs.
Reproducible on OLCF Summit.
Could we replace `self._lock` itself with the timeout-enabled one? The parent Apex distributed optimizer class also uses `self._lock` (e.g., [here](https://github.com/NVIDIA/apex/blob/master/apex/contrib/optimizers/distributed_fused_adam.py#L832)) and we want to catch those as well if they...
Testing a modified version in this draft PR: https://github.com/NVIDIA/NeMo/pull/9087
@jbaczek Could you add the changes in [this NeMo PR](https://github.com/NVIDIA/NeMo/pull/8290/files) to the `AutocastTransformerLayer` here as well? We would need this to comply with the changes to TP knobs in [this...
LGTM, we only looked at adding master weights for FP16 AMP at the time of the original PR. @crcrpar Could you review this as well?