Bo Tian issues

Results 4 issues of


                                            Bo Tian

Doesn't seem to work with iOS 14 beta 5

DDP models are different when training is interrupted

Hi folks, not sure if I'm doing anything wrong. I saw a problem where the final models across ranks are different when training is interrupted. To reproduce: Use the following...

ProcessGroupBabyNCCL - EOFError

I saw this error when replacing ProcessGroupNCCL with ProcessGroupBabyNCCL with train_ddp.py. ProcessGroupNCCL and ProcessGroupGloo work fine. How can I debug this? ``` ERROR:torchft.manager:[/0 - step 0] got exception in future...

Using get_optimizer_state_dict inside state_dict causes TorchFT to get stuck

When I use torch.distributed.checkpoint.state_dict.get_optimizer_state_dict, e.g., ``` optimizer_state_dict = get_optimizer_state_dict( model=self._model, optimizers=self._optimizer, options=StateDictOptions( full_state_dict=True, cpu_offload=True, ), ) ``` instead of `optimizer_state_dict = self._optimizer.state_dict()`, TorchFT gets stuck at manager.py `should_commit()` method. Why...