Jaemin Choi

Results 10 comments of Jaemin Choi

Tests successful on OLCF Summit: `jsrun -n2 -a1 -c2 -g1 --smpiargs="-disable_gpu_hooks" ./bandwidth +ppn 1 +pemap L0 +commap L1` Need to add documentation.

Still hangs on OLCF Summit.

Once `[iter]` (reference number matching) is removed from `recv()`, the code works fine.

@stwhite91 Thanks for pointing that out, I'm not sure yet as to what's exactly different but that test passes and this one hangs.

Could we replace `self._lock` itself with the timeout-enabled one? The parent Apex distributed optimizer class also uses `self._lock` (e.g., [here](https://github.com/NVIDIA/apex/blob/master/apex/contrib/optimizers/distributed_fused_adam.py#L832)) and we want to catch those as well if they...

Testing a modified version in this draft PR: https://github.com/NVIDIA/NeMo/pull/9087

@jbaczek Could you add the changes in [this NeMo PR](https://github.com/NVIDIA/NeMo/pull/8290/files) to the `AutocastTransformerLayer` here as well? We would need this to comply with the changes to TP knobs in [this...

LGTM, we only looked at adding master weights for FP16 AMP at the time of the original PR. @crcrpar Could you review this as well?