Bo Tian

Results 4 issues of Bo Tian

Hi folks, not sure if I'm doing anything wrong. I saw a problem where the final models across ranks are different when training is interrupted. To reproduce: Use the following...

I saw this error when replacing ProcessGroupNCCL with ProcessGroupBabyNCCL with train_ddp.py. ProcessGroupNCCL and ProcessGroupGloo work fine. How can I debug this? ``` ERROR:torchft.manager:[/0 - step 0] got exception in future...

When I use torch.distributed.checkpoint.state_dict.get_optimizer_state_dict, e.g., ``` optimizer_state_dict = get_optimizer_state_dict( model=self._model, optimizers=self._optimizer, options=StateDictOptions( full_state_dict=True, cpu_offload=True, ), ) ``` instead of `optimizer_state_dict = self._optimizer.state_dict()`, TorchFT gets stuck at manager.py `should_commit()` method. Why...