Megatron-LM [QUESTION] Is FP32 supported in MultiNode Training

We plan to finetune a model in MegatronLM, the model (11B) is sharded with tp=4, pp=16. We want to finetune the model in fp32 rather than fp16 or bf16. The error is as follows:

TypeError: Can't instantiate abstract class FP32Optimizer with abstract methods sharded_state_dict
        return FP32Optimizer(optimizer, config, init_state_fn,)return FP32Optimizer(optimizer, config, init_state_fn,)

  File "/work/home/sx_mitee/Megatron-LM/megatron/core/optimizer/__init__.py", line 252, in _get_megatron_optimizer_based_on_param_groups
TypeError: Can't instantiate abstract class FP32Optimizer with abstract methods sharded_state_dict
    return FP32Optimizer(optimizer, config, init_state_fn,)
    return FP32Optimizer(optimizer, config, init_state_fn,)
    _get_megatron_optimizer_based_on_param_groups(

May 24 '24 07:05 JiwenJ

Stay tuned, we have a bugfix in the works.

May 24 '24 15:05 deepakn94

I try to use the flag --use-distributed-optimzier, and it does not report the error. So is --use-distributed-optimzier a must for multinode training in fp32?

May 25 '24 04:05 JiwenJ

Should work now: https://github.com/NVIDIA/Megatron-LM/commit/020b51796e1302bc91a30154b665a4d0afc59dd6.

May 29 '24 05:05 deepakn94

Going to close this, please re-open if you are still seeing issues.

Jun 01 '24 16:06 deepakn94