Error converting T5 .pt model to .nemo model
We are trying to convert a .pt model to .nemo to use for prompt learning, following this example: examples/nlp/language_modeling/megatron_t5_prompt_learning.py. We have trained a T5 model from scratch with Megatron LM codebase, the version we used is 3.0; NeMo code comes with the docker image we used that has nemo-toolkit 1.10.0 The file megatron_t5_prompt_learning.py mentions to use megatron_ckpt_to_nemo.py or megatron_lm_ckpt_to_nemo.py to convert a model from .pt to .nemo, but no one of the above methods work.
With this command:
python -m torch.distributed.launch --nproc_per_node=1 megatron_lm_ckpt_to_nemo.py \
--checkpoint_folder /workspace/datadrive/t5_checkpoint/iter_0896000 \
--checkpoint_name model_optim_rng.pt \
--nemo_file_path /workspace/datadrive/t5_checkpoint/ \
--model_type t5 \
--tensor_model_parallel_size 1 \
--pipeline_model_parallel_size 1 \
--gpus_per_node 1
I get this error:
root@693cc90da271:/workspace/nemo/examples/nlp/language_modeling# python -m torch.distributed.launch --nproc_per_node=1 megatron_lm_ckpt_to_nemo.py \
> --checkpoint_folder /workspace/datadrive/t5_checkpoint/iter_0896000 \
> --checkpoint_name model_optim_rng.pt \
> --nemo_file_path /workspace/datadrive/t5_checkpoint/ \
> --model_type t5 \
> --tensor_model_parallel_size 1 \
> --pipeline_model_parallel_size 1 \
> --gpus_per_node 1
/opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py:178: FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torchrun.
Note that --use_env is set by default in torchrun.
If your script expects `--local_rank` argument to be set, please
change it to read from `os.environ['LOCAL_RANK']` instead. See
https://pytorch.org/docs/stable/distributed.html#launch-utility for
further instructions
warnings.warn(
/opt/conda/lib/python3.8/site-packages/apex/pyprof/__init__.py:5: FutureWarning: pyprof will be removed by the end of June, 2022
warnings.warn("pyprof will be removed by the end of June, 2022", FutureWarning)
[NeMo W 2022-08-17 20:49:04 experimental:27] Module <class 'nemo.collections.nlp.data.language_modeling.megatron.megatron_batch_samplers.MegatronPretrainingRandomBatchSampler'> is experimental, not ready for production and is not fully supported. Use at your own risk.
[NeMo W 2022-08-17 20:49:04 experimental:27] Module <class 'nemo.collections.nlp.models.text_normalization_as_tagging.thutmose_tagger.ThutmoseTaggerModel'> is experimental, not ready for production and is not fully supported. Use at your own risk.
[NeMo I 2022-08-17 20:49:04 distributed:31] Initializing torch.distributed with local_rank: 0, rank: 0, world_size: 1
[NeMo W 2022-08-17 20:49:06 nemo_logging:349] /opt/conda/lib/python3.8/site-packages/pytorch_lightning/loops/utilities.py:91: PossibleUserWarning: `max_epochs` was not set. Setting it to 1000 epochs. To train without an epoch limit, set `max_epochs=-1`.
rank_zero_warn(
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
[NeMo I 2022-08-17 20:49:06 megatron_lm_ckpt_to_nemo:387] loading checkpoint /workspace/datadrive/t5_checkpoint/iter_0896000/model_optim_rng.pt
Traceback (most recent call last):
File "megatron_lm_ckpt_to_nemo.py", line 478, in <module>
convert(local_rank, rank, world_size, args)
File "megatron_lm_ckpt_to_nemo.py", line 420, in convert
raise NotImplemented("{} is not supported".format(args.model_type))
TypeError: 'NotImplementedType' object is not callable
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 911) of binary: /opt/conda/bin/python
Traceback (most recent call last):
File "/opt/conda/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/opt/conda/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py", line 193, in <module>
main()
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py", line 189, in main
launch(args)
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py", line 174, in launch
run(args)
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/run.py", line 752, in run
elastic_launch(
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
megatron_lm_ckpt_to_nemo.py FAILED
------------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2022-08-17_20:49:09
host : 693cc90da271
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 911)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
With this other command:
export PYTHONPATH=$PYTHONPATH:/workspace/datadrive/Megatron-LM
python -m torch.distributed.launch --nproc_per_node=1 \
megatron_ckpt_to_nemo.py \
--gpus_per_node 1 \
--model_type t5 \
--checkpoint_folder /workspace/datadrive/t5_checkpoint/iter_0896000 \
--checkpoint_name model_optim_rng.pt \
--nemo_file_path /workspace/datadrive/t5_nemo \
--tensor_model_parallel_size 1 \
--pipeline_model_parallel_size 1
I get this error:
root@693cc90da271:/workspace/nemo/examples/nlp/language_modeling# export PYTHONPATH=$PYTHONPATH:/workspace/datadrive/Megatron-LM
root@693cc90da271:/workspace/nemo/examples/nlp/language_modeling# python -m torch.distributed.launch --nproc_per_node=1 megatron_ckpt_to_nemo.py --gpus_per_node 1 --model_type t5 --checkpoint_folder /workspace/datadrive/t5_checkpoint/iter_0896000 --checkpoint_name model_optim_rng.pt --nemo_file_path /workspace/datadrive/t5_nemo --tensor_model_parallel_size 1 --pipeline_model_parallel_size 1
/opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py:178: FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torchrun.
Note that --use_env is set by default in torchrun.
If your script expects `--local_rank` argument to be set, please
change it to read from `os.environ['LOCAL_RANK']` instead. See
https://pytorch.org/docs/stable/distributed.html#launch-utility for
further instructions
warnings.warn(
/opt/conda/lib/python3.8/site-packages/apex/pyprof/__init__.py:5: FutureWarning: pyprof will be removed by the end of June, 2022
warnings.warn("pyprof will be removed by the end of June, 2022", FutureWarning)
[NeMo W 2022-08-17 20:47:57 experimental:27] Module <class 'nemo.collections.nlp.data.language_modeling.megatron.megatron_batch_samplers.MegatronPretrainingRandomBatchSampler'> is experimental, not ready for production and is not fully supported. Use at your own risk.
[NeMo W 2022-08-17 20:47:57 experimental:27] Module <class 'nemo.collections.nlp.models.text_normalization_as_tagging.thutmose_tagger.ThutmoseTaggerModel'> is experimental, not ready for production and is not fully supported. Use at your own risk.
[NeMo I 2022-08-17 20:47:57 distributed:31] Initializing torch.distributed with local_rank: 0, rank: 0, world_size: 1
[NeMo W 2022-08-17 20:47:58 nemo_logging:349] /opt/conda/lib/python3.8/site-packages/pytorch_lightning/loops/utilities.py:91: PossibleUserWarning: `max_epochs` was not set. Setting it to 1000 epochs. To train without an epoch limit, set `max_epochs=-1`.
rank_zero_warn(
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
[NeMo I 2022-08-17 20:47:58 megatron_ckpt_to_nemo:111] rank: 0, local_rank: 0, is loading checkpoint: /workspace/datadrive/t5_checkpoint/iter_0896000/model_optim_rng.pt for tp_rank: 0 and pp_rank: 0
Traceback (most recent call last):
File "megatron_ckpt_to_nemo.py", line 144, in <module>
convert(local_rank, rank, world_size, args)
File "megatron_ckpt_to_nemo.py", line 122, in convert
model = MegatronT5Model.load_from_checkpoint(checkpoint_path, hparams_file=args.hparams_file, trainer=trainer)
File "/opt/conda/lib/python3.8/site-packages/nemo/collections/nlp/models/nlp_model.py", line 354, in load_from_checkpoint
model = cls._load_model_state(checkpoint, strict=strict, cfg=cfg, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/core/saving.py", line 203, in _load_model_state
model = cls(**_cls_kwargs)
File "/opt/conda/lib/python3.8/site-packages/nemo/collections/nlp/models/language_modeling/megatron_t5_model.py", line 35, in __init__
super().__init__(cfg, trainer=trainer)
File "/opt/conda/lib/python3.8/site-packages/nemo/collections/nlp/models/language_modeling/megatron_lm_encoder_decoder_model.py", line 72, in __init__
super().__init__(cfg, trainer=trainer)
File "/opt/conda/lib/python3.8/site-packages/nemo/collections/nlp/models/language_modeling/megatron_base_model.py", line 61, in __init__
super().__init__(cfg, trainer=trainer, no_lm_init=no_lm_init)
File "/opt/conda/lib/python3.8/site-packages/nemo/collections/nlp/models/nlp_model.py", line 98, in __init__
super().__init__(cfg, trainer)
File "/opt/conda/lib/python3.8/site-packages/nemo/core/classes/modelPT.py", line 97, in __init__
cfg = model_utils.convert_model_config_to_dict_config(cfg)
File "/opt/conda/lib/python3.8/site-packages/nemo/utils/model_utils.py", line 393, in convert_model_config_to_dict_config
raise ValueError(f"cfg constructor argument must be of type DictConfig/dict but got {type(cfg)} instead.")
ValueError: cfg constructor argument must be of type DictConfig/dict but got <class 'dict'> instead.
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 690) of binary: /opt/conda/bin/python
Traceback (most recent call last):
File "/opt/conda/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/opt/conda/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py", line 193, in <module>
main()
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py", line 189, in main
launch(args)
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py", line 174, in launch
run(args)
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/run.py", line 752, in run
elastic_launch(
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
megatron_ckpt_to_nemo.py FAILED
------------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2022-08-17_20:48:03
host : 693cc90da271
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 690)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
Environment overview
- I’m running the commands on a VM in GCP with A100 GPUs
- I’m using NeMo from docker image
- sudo docker run --shm-size=256g --gpus all -it --rm -v ~/datadrive:/workspace/datadrive nvcr.io/nvidia/nemo:22.05
- the code snippet are taken directly from the corresponding .py file
you need to use megatron_t5_prompt_learning.py to run the conversion. Currently only GPT and BERT models are supported. For T5, we don't have the implementation yet.
This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.