Kazuki Fujii

Results 2 issues of Kazuki Fujii

**Describe the bug** When the data parallel size is odd while distributed optimizer is enabled, training stops with the following error. ```bash [Proxy Service 0] Failed to execute operation Connect...

## Issue When using TransformerEngine with Megatron-LM for training, I encountered an issue where the Loss Curve would significantly change after loading a checkpoint. This problem did not occur when...

stale