NeMo TransformerLayer MLP parameters are not being set during model initialization

Describe the bug

Fix is implemented here: https://github.com/NVIDIA/NeMo/pull/8845

Transformer layer MLP always uses default values for bias, activation, and normalization if model.mcore_gpt=False, model.transformer_engine=True, model.megatron_amp_O2=True.

Steps/Code to reproduce bug

Add the following lines to NeMo/examples/nlp/language_modeling/megatron_gpt_pretraining.py after model initialization:

logging.warning(f"DEBUG: layernorm_mlp.activation={model.model.module.language_model.encoder.layers._modules['0'].layernorm_mlp.activation}")
logging.warning(f"DEBUG: layernorm_mlp.use_bias={model.model.module.language_model.encoder.layers._modules['0'].layernorm_mlp.use_bias}")
logging.warning(f"DEBUG: layernorm_mlp.normalization={model.model.module.language_model.encoder.layers._modules['0'].layernorm_mlp.normalization}")
logging.warning(f"DEBUG: layernorm_mlp.layernorm_mlp.fc1_weight.shape={model.model.module.language_model.encoder.layers._modules['0'].layernorm_mlp.fc1_weight.shape}")
logging.warning(f"DEBUG: layernorm_mlp.layernorm_mlp.fc2_weight.shape={model.model.module.language_model.encoder.layers._modules['0'].layernorm_mlp.fc2_weight.shape}")

Run the following script with and without the changes:

#!/bin/bash

python /opt/NeMo/examples/nlp/language_modeling/megatron_gpt_pretraining.py \
    model.mcore_gpt=False \
    model.transformer_engine=True \
    trainer.precision=bf16 \
    model.megatron_amp_O2=True \
    model.activation=fast-swiglu \
    model.bias=false \
    model.normalization=rmsnorm

Expected behavior

The values for layernorm_mlp are set correctly:

[NeMo W 2024-04-08 09:30:49 megatron_gpt_pretraining:42] DEBUG: layernorm_mlp.activation=fast-swiglu
[NeMo W 2024-04-08 09:30:49 megatron_gpt_pretraining:43] DEBUG: layernorm_mlp.use_bias=False
[NeMo W 2024-04-08 09:30:49 megatron_gpt_pretraining:44] DEBUG: layernorm_mlp.normalization=RMSNorm
[NeMo W 2024-04-08 09:30:49 megatron_gpt_pretraining:45] DEBUG: layernorm_mlp.layernorm_mlp.fc1_weight.shape=torch.Size([3072, 768])
[NeMo W 2024-04-08 09:30:49 megatron_gpt_pretraining:46] DEBUG: layernorm_mlp.layernorm_mlp.fc2_weight.shape=torch.Size([768, 3072])

Environment overview (please complete the following information)

Docker
nvidia/nemo:24.03 + git pull
docker run --rm -it --entrypoint /bin/bash --network=host --runtime=nvidia --shm-size=2g nvcr.io/nvidia/nemo:24.01.01.framework

Environment details

If NVIDIA docker image is used you don't need to specify these. Otherwise, please provide:

OS version
PyTorch version
Python version

Additional context

Add any other context about the problem here. Example: GPU model

Apr 08 '24 10:04 OlegSudakov

This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.

May 09 '24 01:05 github-actions[bot]

This issue was closed because it has been inactive for 7 days since being marked as stale.

May 17 '24 01:05 github-actions[bot]