DeepSpeed icon indicating copy to clipboard operation
DeepSpeed copied to clipboard

share inflight registry between PartitionedParameterCoordinators

Open HeyangQin opened this issue 2 years ago • 0 comments

This is a collaborative effort with the Lightning team to solve https://github.com/microsoft/DeepSpeed/issues/3068 and https://github.com/microsoft/DeepSpeed/issues/3156. More discussion at https://github.com/Lightning-AI/lightning/issues/17523

There could be multiple PartitionedParameterCoordinator instances, yet they currently manage the parameters in a standalone manner. Let's say we have PartitionedParameterCoordinator A and B. When A puts some parameters inflight, B is not aware of that and when B tries to use these parameters it will just error out. This PR addresses this issue by making the __InflightParamRegistry shared among all PartitionedParameterCoordinator instances. Different from the https://github.com/microsoft/DeepSpeed/pull/3380, this PR would bind the registry to the model so it doesn't break multi-model training

HeyangQin avatar May 05 '23 18:05 HeyangQin