Questions about the Sharing of optimizer states
I noticed that ColossalAI provides few optimizers, such as 'FusedLAMB', 'FusedAdam', 'FusedSGD', 'Lamb', 'Lars', 'CPUAdam', 'HybridAdam'. These optimizers shards optimizer states based on the size of parameters and gradients. My question is that if we are not using these optimizers provided by ColossalAI, should I rewrite the optimizer to shard optimizer states? Or I don't need to do that, as long as the parameters and gradients are already sharded, then optimizer states will be sharded automatically?
I also saw a comment in the code: “Inner optimizer must support optimizing hybrid (CPU and CUDA) tensors, and it must set num_fp32_shards_per_param correctly“? I feel that this requirement is needed only if we want to change tensor_placement_policy between "cpu"/"auto" and "cuda"? Otherwise, we don't need the optimizer to support hybrid tensors?
- Yes, optimizer states will be sharded automatically.
- If
tensor_placement_policyis "cpu" or "cuda", we don't need the optimizer to support hybrid tensors.
Great. Thank you very much!