Megatron-LM
Megatron-LM copied to clipboard
[QUESTION] Is FP32 supported in MultiNode Training
We plan to finetune a model in MegatronLM, the model (11B) is sharded with tp=4, pp=16. We want to finetune the model in fp32 rather than fp16 or bf16. The error is as follows:
TypeError: Can't instantiate abstract class FP32Optimizer with abstract methods sharded_state_dict
return FP32Optimizer(optimizer, config, init_state_fn,)return FP32Optimizer(optimizer, config, init_state_fn,)
File "/work/home/sx_mitee/Megatron-LM/megatron/core/optimizer/__init__.py", line 252, in _get_megatron_optimizer_based_on_param_groups
TypeError: Can't instantiate abstract class FP32Optimizer with abstract methods sharded_state_dict
return FP32Optimizer(optimizer, config, init_state_fn,)
return FP32Optimizer(optimizer, config, init_state_fn,)
_get_megatron_optimizer_based_on_param_groups(
Stay tuned, we have a bugfix in the works.
I try to use the flag --use-distributed-optimzier, and it does not report the error. So is --use-distributed-optimzier a must for multinode training in fp32?
Should work now: https://github.com/NVIDIA/Megatron-LM/commit/020b51796e1302bc91a30154b665a4d0afc59dd6.
Going to close this, please re-open if you are still seeing issues.