NeMo Error converting T5 .pt model to .nemo model

We are trying to convert a .pt model to .nemo to use for prompt learning, following this example: examples/nlp/language_modeling/megatron_t5_prompt_learning.py. We have trained a T5 model from scratch with Megatron LM codebase, the version we used is 3.0; NeMo code comes with the docker image we used that has nemo-toolkit 1.10.0 The file megatron_t5_prompt_learning.py mentions to use megatron_ckpt_to_nemo.py or megatron_lm_ckpt_to_nemo.py to convert a model from .pt to .nemo, but no one of the above methods work.

With this command:

python -m torch.distributed.launch --nproc_per_node=1 megatron_lm_ckpt_to_nemo.py \
--checkpoint_folder /workspace/datadrive/t5_checkpoint/iter_0896000  \
--checkpoint_name model_optim_rng.pt \
--nemo_file_path /workspace/datadrive/t5_checkpoint/ \
--model_type t5 \
--tensor_model_parallel_size 1 \
--pipeline_model_parallel_size 1 \
--gpus_per_node 1

I get this error:

root@693cc90da271:/workspace/nemo/examples/nlp/language_modeling# python -m torch.distributed.launch --nproc_per_node=1 megatron_lm_ckpt_to_nemo.py \
> --checkpoint_folder /workspace/datadrive/t5_checkpoint/iter_0896000  \
> --checkpoint_name model_optim_rng.pt \
> --nemo_file_path /workspace/datadrive/t5_checkpoint/ \
> --model_type t5 \
> --tensor_model_parallel_size 1 \
> --pipeline_model_parallel_size 1 \
> --gpus_per_node 1
/opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py:178: FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torchrun.
Note that --use_env is set by default in torchrun.
If your script expects `--local_rank` argument to be set, please
change it to read from `os.environ['LOCAL_RANK']` instead. See 
https://pytorch.org/docs/stable/distributed.html#launch-utility for 
further instructions

  warnings.warn(
/opt/conda/lib/python3.8/site-packages/apex/pyprof/__init__.py:5: FutureWarning: pyprof will be removed by the end of June, 2022
  warnings.warn("pyprof will be removed by the end of June, 2022", FutureWarning)
[NeMo W 2022-08-17 20:49:04 experimental:27] Module <class 'nemo.collections.nlp.data.language_modeling.megatron.megatron_batch_samplers.MegatronPretrainingRandomBatchSampler'> is experimental, not ready for production and is not fully supported. Use at your own risk.
[NeMo W 2022-08-17 20:49:04 experimental:27] Module <class 'nemo.collections.nlp.models.text_normalization_as_tagging.thutmose_tagger.ThutmoseTaggerModel'> is experimental, not ready for production and is not fully supported. Use at your own risk.
[NeMo I 2022-08-17 20:49:04 distributed:31] Initializing torch.distributed with local_rank: 0, rank: 0, world_size: 1
[NeMo W 2022-08-17 20:49:06 nemo_logging:349] /opt/conda/lib/python3.8/site-packages/pytorch_lightning/loops/utilities.py:91: PossibleUserWarning: `max_epochs` was not set. Setting it to 1000 epochs. To train without an epoch limit, set `max_epochs=-1`.
      rank_zero_warn(
    
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
[NeMo I 2022-08-17 20:49:06 megatron_lm_ckpt_to_nemo:387] loading checkpoint /workspace/datadrive/t5_checkpoint/iter_0896000/model_optim_rng.pt
Traceback (most recent call last):
  File "megatron_lm_ckpt_to_nemo.py", line 478, in <module>
    convert(local_rank, rank, world_size, args)
  File "megatron_lm_ckpt_to_nemo.py", line 420, in convert
    raise NotImplemented("{} is not supported".format(args.model_type))
TypeError: 'NotImplementedType' object is not callable
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 911) of binary: /opt/conda/bin/python
Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/opt/conda/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py", line 193, in <module>
    main()
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py", line 189, in main
    launch(args)
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py", line 174, in launch
    run(args)
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/run.py", line 752, in run
    elastic_launch(
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
megatron_lm_ckpt_to_nemo.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2022-08-17_20:49:09
  host      : 693cc90da271
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 911)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

With this other command:

export PYTHONPATH=$PYTHONPATH:/workspace/datadrive/Megatron-LM

python -m torch.distributed.launch --nproc_per_node=1 \
megatron_ckpt_to_nemo.py \
--gpus_per_node 1 \
--model_type t5 \
--checkpoint_folder /workspace/datadrive/t5_checkpoint/iter_0896000 \
--checkpoint_name model_optim_rng.pt \
--nemo_file_path /workspace/datadrive/t5_nemo \
--tensor_model_parallel_size 1 \
--pipeline_model_parallel_size 1

I get this error:

root@693cc90da271:/workspace/nemo/examples/nlp/language_modeling# export PYTHONPATH=$PYTHONPATH:/workspace/datadrive/Megatron-LM
root@693cc90da271:/workspace/nemo/examples/nlp/language_modeling# python -m torch.distributed.launch --nproc_per_node=1 megatron_ckpt_to_nemo.py --gpus_per_node 1 --model_type t5 --checkpoint_folder /workspace/datadrive/t5_checkpoint/iter_0896000 --checkpoint_name model_optim_rng.pt --nemo_file_path /workspace/datadrive/t5_nemo --tensor_model_parallel_size 1 --pipeline_model_parallel_size 1 
/opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py:178: FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torchrun.
Note that --use_env is set by default in torchrun.
If your script expects `--local_rank` argument to be set, please
change it to read from `os.environ['LOCAL_RANK']` instead. See 
https://pytorch.org/docs/stable/distributed.html#launch-utility for 
further instructions

  warnings.warn(
/opt/conda/lib/python3.8/site-packages/apex/pyprof/__init__.py:5: FutureWarning: pyprof will be removed by the end of June, 2022
  warnings.warn("pyprof will be removed by the end of June, 2022", FutureWarning)
[NeMo W 2022-08-17 20:47:57 experimental:27] Module <class 'nemo.collections.nlp.data.language_modeling.megatron.megatron_batch_samplers.MegatronPretrainingRandomBatchSampler'> is experimental, not ready for production and is not fully supported. Use at your own risk.
[NeMo W 2022-08-17 20:47:57 experimental:27] Module <class 'nemo.collections.nlp.models.text_normalization_as_tagging.thutmose_tagger.ThutmoseTaggerModel'> is experimental, not ready for production and is not fully supported. Use at your own risk.
[NeMo I 2022-08-17 20:47:57 distributed:31] Initializing torch.distributed with local_rank: 0, rank: 0, world_size: 1
[NeMo W 2022-08-17 20:47:58 nemo_logging:349] /opt/conda/lib/python3.8/site-packages/pytorch_lightning/loops/utilities.py:91: PossibleUserWarning: `max_epochs` was not set. Setting it to 1000 epochs. To train without an epoch limit, set `max_epochs=-1`.
      rank_zero_warn(
    
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
[NeMo I 2022-08-17 20:47:58 megatron_ckpt_to_nemo:111] rank: 0, local_rank: 0, is loading checkpoint: /workspace/datadrive/t5_checkpoint/iter_0896000/model_optim_rng.pt for tp_rank: 0 and pp_rank: 0
Traceback (most recent call last):
  File "megatron_ckpt_to_nemo.py", line 144, in <module>
    convert(local_rank, rank, world_size, args)
  File "megatron_ckpt_to_nemo.py", line 122, in convert
    model = MegatronT5Model.load_from_checkpoint(checkpoint_path, hparams_file=args.hparams_file, trainer=trainer)
  File "/opt/conda/lib/python3.8/site-packages/nemo/collections/nlp/models/nlp_model.py", line 354, in load_from_checkpoint
    model = cls._load_model_state(checkpoint, strict=strict, cfg=cfg, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/core/saving.py", line 203, in _load_model_state
    model = cls(**_cls_kwargs)
  File "/opt/conda/lib/python3.8/site-packages/nemo/collections/nlp/models/language_modeling/megatron_t5_model.py", line 35, in __init__
    super().__init__(cfg, trainer=trainer)
  File "/opt/conda/lib/python3.8/site-packages/nemo/collections/nlp/models/language_modeling/megatron_lm_encoder_decoder_model.py", line 72, in __init__
    super().__init__(cfg, trainer=trainer)
  File "/opt/conda/lib/python3.8/site-packages/nemo/collections/nlp/models/language_modeling/megatron_base_model.py", line 61, in __init__
    super().__init__(cfg, trainer=trainer, no_lm_init=no_lm_init)
  File "/opt/conda/lib/python3.8/site-packages/nemo/collections/nlp/models/nlp_model.py", line 98, in __init__
    super().__init__(cfg, trainer)
  File "/opt/conda/lib/python3.8/site-packages/nemo/core/classes/modelPT.py", line 97, in __init__
    cfg = model_utils.convert_model_config_to_dict_config(cfg)
  File "/opt/conda/lib/python3.8/site-packages/nemo/utils/model_utils.py", line 393, in convert_model_config_to_dict_config
    raise ValueError(f"cfg constructor argument must be of type DictConfig/dict but got {type(cfg)} instead.")
ValueError: cfg constructor argument must be of type DictConfig/dict but got <class 'dict'> instead.
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 690) of binary: /opt/conda/bin/python
Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/opt/conda/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py", line 193, in <module>
    main()
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py", line 189, in main
    launch(args)
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py", line 174, in launch
    run(args)
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/run.py", line 752, in run
    elastic_launch(
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
megatron_ckpt_to_nemo.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2022-08-17_20:48:03
  host      : 693cc90da271
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 690)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

Environment overview

I’m running the commands on a VM in GCP with A100 GPUs
I’m using NeMo from docker image
sudo docker run --shm-size=256g --gpus all -it --rm -v ~/datadrive:/workspace/datadrive nvcr.io/nvidia/nemo:22.05
the code snippet are taken directly from the corresponding .py file

Aug 17 '22 21:08 matteo-zola

you need to use megatron_t5_prompt_learning.py to run the conversion. Currently only GPT and BERT models are supported. For T5, we don't have the implementation yet.

Aug 19 '22 14:08 yidong72

This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.

Oct 06 '22 02:10 github-actions[bot]