Max Kovalenko
Max Kovalenko
We've discovered the following issues in the current implementation of the Throughput timer: - The timer invokes synchronize() twice on each step at [start](https://github.com/microsoft/DeepSpeed/blob/master/deepspeed/utils/timer.py#L240) and [stop](https://github.com/microsoft/DeepSpeed/blob/master/deepspeed/utils/timer.py#L252). - Calling synchronize() ensures...
Hi @loadams, all the requested changes are done. If you can please review and trigger the Ci. Thanks
> @deepcharm, thanks for this interesting approach. Can you share some observed performance gains? @tjruwase We have observed around 9% performance gain on HPU in BERT workloads.
> Hi @deepcharm > > Thx for the PR. Just curious why allreduce could be faster than allgather? allreduce basically is doing reduce-scatter + all-gather. Could we just make allgather...
> > @deepcharm, I was not aware that narrow, cat, copy operations on device tensors incurred high CPU overhead. I will like to learn more. Can you share the reason?...
Hi @tjruwase, for some reason the PR has been removed from the merge-queue. Can you please re-add it? Thanks
A brutal force solution is to enforce the `.requires_grad` to be `True` for the model input tensors: ``` class PostBackwardFunctionModule(torch.autograd.Function): @staticmethod def forward(ctx, output): ctx.module = module if not output.requires_grad:...
@eternalNight Thank you for the good catch! Updated the code per your request. Please let me know if that works.
Thanks for the detailed testing—super helpful! These errors match known PyTorch issues with Compiled Autograd + distributed/mixed precision: 1) Eager/bfloat16: This is a known PyTorch bug in torch.compile (PyTorch #152162/#161153),...
> @deepcharm the python tests all appear to fail for the same reason: incompatibility with `torch-cpu`. Is this what you are seeing? @sfc-gh-truwase From the CI logs, the failures are...