NeMo cannot load reward model from SFT model because of missing keys

I converted a llama model to nemo, with model dirs like below: When I tried to load it to train a reward model, I got missing keys error. I load it from the default config, set load_base_model_only=True, the total load code is as below:

ptl_model = load_from_nemo( reward_model_cls, cfg.model, trainer, strict=True, load_base_model_only=True, restore_path=cfg.pretrained_checkpoint.restore_from_path, )

And then I got the error below, any advice on how to load a pretrained non-reward model to train as a reward model in Nemo?

Error executing job with overrides: []
Traceback (most recent call last):
  File "/checkpoint/binary/train_package/megatron/core/dist_checkpointing/serialization.py", line 206, in load_sharded_object
    loaded_obj = torch.load(load_path)
  File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/torch/serialization.py", line 998, in load
    with _open_file_like(f, 'rb') as opened_file:
  File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/torch/serialization.py", line 445, in _open_file_like
    return _open_file(name_or_buffer, mode)
  File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/torch/serialization.py", line 426, in __init__
    super().__init__(open(name, mode))
FileNotFoundError: [Errno 2] No such file or directory: '/mnt/workspace/models/LLaMA-2-7B-32K-Nemo-Official/model_weights/model.rm_head._extra_state/shard_0_1.pt'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/checkpoint/binary/train_package/train_reward_model.py", line 68, in main
    ptl_model = load_from_nemo(
  File "/checkpoint/binary/train_package/nemo_aligner/utils/utils.py", line 96, in load_from_nemo
    model = cls.restore_from(
  File "/checkpoint/binary/train_package/nemo/collections/nlp/models/nlp_model.py", line 465, in restore_from
    return super().restore_from(
  File "/checkpoint/binary/train_package/nemo/core/classes/modelPT.py", line 450, in restore_from
    instance = cls._save_restore_connector.restore_from(
  File "/checkpoint/binary/train_package/nemo_aligner/utils/utils.py", line 52, in restore_from
    output = super().restore_from(*args, **kwargs)
  File "/checkpoint/binary/train_package/nemo/collections/nlp/parts/nlp_overrides.py", line 1123, in restore_from
    checkpoint = dist_checkpointing.load(
  File "/checkpoint/binary/train_package/megatron/core/dist_checkpointing/serialization.py", line 120, in load
    sharded_objects, sharded_state_dict = load_sharded_objects(
  File "/checkpoint/binary/train_package/megatron/core/dist_checkpointing/serialization.py", line 221, in load_sharded_objects
    return dict_list_map_inplace(load_sharded_object, sharded_objects), sharded_state_dict
  File "/checkpoint/binary/train_package/megatron/core/dist_checkpointing/dict_utils.py", line 180, in dict_list_map_inplace
    x[k] = dict_list_map_inplace(f, v)
  File "/checkpoint/binary/train_package/megatron/core/dist_checkpointing/dict_utils.py", line 180, in dict_list_map_inplace
    x[k] = dict_list_map_inplace(f, v)
  File "/checkpoint/binary/train_package/megatron/core/dist_checkpointing/dict_utils.py", line 184, in dict_list_map_inplace
    return f(x)
  File "/checkpoint/binary/train_package/megatron/core/dist_checkpointing/serialization.py", line 218, in load_sharded_object
    raise CheckpointingException(err_msg) from e
megatron.core.dist_checkpointing.core.CheckpointingException: Object shard /mnt/workspace/models/LLaMA-2-7B-32K-Nemo-Official/model_weights/model.rm_head._extra_state/shard_0_1.pt not found

Apr 02 '24 07:04 DZ9

This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.

May 03 '24 01:05 github-actions[bot]

This issue was closed because it has been inactive for 7 days since being marked as stale.

May 10 '24 01:05 github-actions[bot]