Max Kovalenko
Max Kovalenko
Introduce use_secondary_tensor bool variable to shorten notation and improve readability.
Enabled gradient accumulation in bf16 optimizer which updates fp32 gradients once the gradient is available. This improves device utilization on some back-ends, by parallelizing the underlying workload across hardware engines....
The new "timers" section describes configuration for different timers. Specifically, in the "throughput" section, it is possible to disable the throughput timer (enabled by default). This allows to avoid the...
* Use all_reduce instead of all_gather to fetch module parameters. This improves performance by reducing the overhead of concatenation and slicing, which are no longer required. * Instead, all tensors...
**Describe the bug** The mechanism of pre-backward and post-backward hooks employs adding a custom autograd function class on tensors, which are either inputs to the module (for [post-backward](https://github.com/microsoft/DeepSpeed/blob/3dd7ccff8103be60c31d963dd2278d43abb68fd1/deepspeed/runtime/zero/parameter_offload.py#L387)) or outputs...
Compiled Autograd is an extension to torch.compile which enhances the autograd engine by capturing a larger backward computation graph at runtime. This allows a more comprehensive optimization of the backward...