Jaemin Choi comments

Results 10 comments of


                                            Jaemin Choi

Channel API in Charm++ for direct GPU-GPU communication with UCX

Tests successful on OLCF Summit: `jsrun -n2 -a1 -c2 -g1 --smpiargs="-disable_gpu_hooks" ./bandwidth +ppn 1 +pemap L0 +commap L1` Need to add documentation.

Checkpoint/restart hangs in SMP mode

Still hangs on OLCF Summit.

CmiCheckAffinity causes hang with UCX build on Summit

Still hangs on Summit.

Add refnum with CkCallback example that hangs

Once `[iter]` (reference number matching) is removed from `recv()`, the code works fine.

Add refnum with CkCallback example that hangs

@stwhite91 Thanks for pointing that out, I'm not sure yet as to what's exactly different but that test passes and this one hangs.

SDAG: Refnum matching requires taking msg ptr as parameter

Reproducible on OLCF Summit.

Add option for mutex timeout in distributed optimizer backward hook

Could we replace `self._lock` itself with the timeout-enabled one? The parent Apex distributed optimizer class also uses `self._lock` (e.g., [here](https://github.com/NVIDIA/apex/blob/master/apex/contrib/optimizers/distributed_fused_adam.py#L832)) and we want to catch those as well if they...

Jaemin Choi

Channel API in Charm++ for direct GPU-GPU communication with UCX

Checkpoint/restart hangs in SMP mode

CmiCheckAffinity causes hang with UCX build on Summit

Add refnum with CkCallback example that hangs

Add refnum with CkCallback example that hangs

SDAG: Refnum matching requires taking msg ptr as parameter

Add option for mutex timeout in distributed optimizer backward hook

Add option for mutex timeout in distributed optimizer backward hook

Add mcore full TE transformer layer spec

Use master weights for bfloat16 FusedAdam when master_weights=True