cannot load reward model from SFT model because of missing keys
I converted a llama model to nemo, with model dirs like below:
When I tried to load it to train a reward model, I got missing keys error. I load it from the default config, set
load_base_model_only=True, the total load code is as below:
ptl_model = load_from_nemo( reward_model_cls, cfg.model, trainer, strict=True, load_base_model_only=True, restore_path=cfg.pretrained_checkpoint.restore_from_path, )
And then I got the error below, any advice on how to load a pretrained non-reward model to train as a reward model in Nemo?
Error executing job with overrides: []
Traceback (most recent call last):
File "/checkpoint/binary/train_package/megatron/core/dist_checkpointing/serialization.py", line 206, in load_sharded_object
loaded_obj = torch.load(load_path)
File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/torch/serialization.py", line 998, in load
with _open_file_like(f, 'rb') as opened_file:
File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/torch/serialization.py", line 445, in _open_file_like
return _open_file(name_or_buffer, mode)
File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/torch/serialization.py", line 426, in __init__
super().__init__(open(name, mode))
FileNotFoundError: [Errno 2] No such file or directory: '/mnt/workspace/models/LLaMA-2-7B-32K-Nemo-Official/model_weights/model.rm_head._extra_state/shard_0_1.pt'
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/checkpoint/binary/train_package/train_reward_model.py", line 68, in main
ptl_model = load_from_nemo(
File "/checkpoint/binary/train_package/nemo_aligner/utils/utils.py", line 96, in load_from_nemo
model = cls.restore_from(
File "/checkpoint/binary/train_package/nemo/collections/nlp/models/nlp_model.py", line 465, in restore_from
return super().restore_from(
File "/checkpoint/binary/train_package/nemo/core/classes/modelPT.py", line 450, in restore_from
instance = cls._save_restore_connector.restore_from(
File "/checkpoint/binary/train_package/nemo_aligner/utils/utils.py", line 52, in restore_from
output = super().restore_from(*args, **kwargs)
File "/checkpoint/binary/train_package/nemo/collections/nlp/parts/nlp_overrides.py", line 1123, in restore_from
checkpoint = dist_checkpointing.load(
File "/checkpoint/binary/train_package/megatron/core/dist_checkpointing/serialization.py", line 120, in load
sharded_objects, sharded_state_dict = load_sharded_objects(
File "/checkpoint/binary/train_package/megatron/core/dist_checkpointing/serialization.py", line 221, in load_sharded_objects
return dict_list_map_inplace(load_sharded_object, sharded_objects), sharded_state_dict
File "/checkpoint/binary/train_package/megatron/core/dist_checkpointing/dict_utils.py", line 180, in dict_list_map_inplace
x[k] = dict_list_map_inplace(f, v)
File "/checkpoint/binary/train_package/megatron/core/dist_checkpointing/dict_utils.py", line 180, in dict_list_map_inplace
x[k] = dict_list_map_inplace(f, v)
File "/checkpoint/binary/train_package/megatron/core/dist_checkpointing/dict_utils.py", line 184, in dict_list_map_inplace
return f(x)
File "/checkpoint/binary/train_package/megatron/core/dist_checkpointing/serialization.py", line 218, in load_sharded_object
raise CheckpointingException(err_msg) from e
megatron.core.dist_checkpointing.core.CheckpointingException: Object shard /mnt/workspace/models/LLaMA-2-7B-32K-Nemo-Official/model_weights/model.rm_head._extra_state/shard_0_1.pt not found
This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.
This issue was closed because it has been inactive for 7 days since being marked as stale.