Megatron-LM icon indicating copy to clipboard operation
Megatron-LM copied to clipboard

[QUESTION] Is FP32 supported in MultiNode Training

Open JiwenJ opened this issue 1 year ago • 2 comments

We plan to finetune a model in MegatronLM, the model (11B) is sharded with tp=4, pp=16. We want to finetune the model in fp32 rather than fp16 or bf16. The error is as follows:

TypeError: Can't instantiate abstract class FP32Optimizer with abstract methods sharded_state_dict
        return FP32Optimizer(optimizer, config, init_state_fn,)return FP32Optimizer(optimizer, config, init_state_fn,)

  File "/work/home/sx_mitee/Megatron-LM/megatron/core/optimizer/__init__.py", line 252, in _get_megatron_optimizer_based_on_param_groups
TypeError: Can't instantiate abstract class FP32Optimizer with abstract methods sharded_state_dict
    return FP32Optimizer(optimizer, config, init_state_fn,)
    return FP32Optimizer(optimizer, config, init_state_fn,)
    _get_megatron_optimizer_based_on_param_groups(

JiwenJ avatar May 24 '24 07:05 JiwenJ

Stay tuned, we have a bugfix in the works.

deepakn94 avatar May 24 '24 15:05 deepakn94

I try to use the flag --use-distributed-optimzier, and it does not report the error. So is --use-distributed-optimzier a must for multinode training in fp32?

JiwenJ avatar May 25 '24 04:05 JiwenJ

Should work now: https://github.com/NVIDIA/Megatron-LM/commit/020b51796e1302bc91a30154b665a4d0afc59dd6.

deepakn94 avatar May 29 '24 05:05 deepakn94

Going to close this, please re-open if you are still seeing issues.

deepakn94 avatar Jun 01 '24 16:06 deepakn94