[QUESTION] Why should use weights_only=False when load checkpoint
Your question Ask a clear and concise question about Megatron-LM.
In the below commit, I find that weights_only is False when torch load checkpoint, so WHY? https://github.com/NVIDIA/Megatron-LM/commit/eee2bc9a74ba9cba70d8fbe0e7384d1ea243f904_
I thinks this is beacuse it assumes to load additional metadata stored in the checkpoint. You can refer to the load_checkpoint defined in megatron/training/checkpointing.py to see how the returned state_dict be used.
Torch 2.6 + does not support weights_only = True
In Torch 2.6, set weights_only to True by default to avoid RCE. However, are there any security issues if we set weights_only to False explicitly here? For example, a malicious checkpoint file could contain code that allows arbitrary execution.
@cnspary you can allow list specific classes if needed while keeping weights_only=True. More info here: https://github.com/NVIDIA/Megatron-LM/blob/main/docs/source/api-guide/dist_checkpointing.rst#safe-checkpoint-loading