Kazuki Fujii
Results
2
issues of
Kazuki Fujii
**Describe the bug** When the data parallel size is odd while distributed optimizer is enabled, training stops with the following error. ```bash [Proxy Service 0] Failed to execute operation Connect...
## Issue When using TransformerEngine with Megatron-LM for training, I encountered an issue where the Loss Curve would significantly change after loading a checkpoint. This problem did not occur when...
stale