dim unmatch when doing sft with tensor parallel and sequence parallel and LoRA
Describe the bug
I was training to run sft based on Mixtral-8x7B-instruct model with tensor parallel size=4 (sequence parallel=True) and LoRA (target modules =[all]). It reports that the output dims of original module and the corresponding lora adapter module is not matched so they cannot be added together.
Steps/Code to reproduce bug
I used the recommended docker nvcr.io/nvidia/nemo:24.07 and my scripts is as follow:
export CUDA_VISIBLE_DEVICES="4,5,6,7"
nemo_ckpt_path=/path/to/Mixtral-8x7B-instruct-v0.1-nemo
train_data='/path/to/train.json'
valid_data='/path/to/valid.json'
output='/data/results/mixtral-8x7b-instruct-sft'
python /opt/NeMo-Aligner/examples/nlp/gpt/train_gpt_sft.py \
trainer.precision=bf16 \
trainer.num_nodes=1 \
trainer.devices=4 \
trainer.sft.max_steps=-1 \
trainer.sft.limit_val_batches=40 \
trainer.sft.val_check_interval=1000 \
trainer.sft.max_epochs=2 \
model.tensor_model_parallel_size=4 \
model.sequence_parallel=True \
model.use_flash_attention=True \
model.encoder_seq_length=1024 \
model.megatron_amp_O2=False \
model.restore_from_path=$nemo_ckpt_path \
model.optim.lr=5e-6 \
model.optim.name=fused_adam\
model.answer_only_loss=True \
model.data.num_workers=0 \
model.data.train_ds.micro_batch_size=1 \
model.data.train_ds.global_batch_size=128 \
model.data.train_ds.file_path=$train_data \
model.data.validation_ds.micro_batch_size=1 \
model.data.validation_ds.global_batch_size=128 \
model.data.validation_ds.file_path=$valid_data \
model.peft.peft_scheme=lora \
model.peft.lora_tuning.target_modules=[attention_dense,mlp_fc1,mlp_fc2] \
model.peft.lora_tuning.adapter_dim=128 \
exp_manager.create_wandb_logger=False \
exp_manager.explicit_log_dir=$output \
exp_manager.wandb_logger_kwargs.project=sft_run \
exp_manager.wandb_logger_kwargs.name=ruozhiba_sft_run \
exp_manager.checkpoint_callback_params.save_nemo_on_train_end=True \
exp_manager.resume_if_exists=True \
exp_manager.resume_ignore_no_checkpoint=True \
exp_manager.create_checkpoint_callback=True \
exp_manager.checkpoint_callback_params.monitor=validation_loss
And it runs into error like:
...
File "/opt/NeMo-Aligner/nemo_aligner/algorithms/supervised.py", line 145, in train_single_step
loss_mean, metrics = self.model.get_loss_and_metrics(batch=batch, forward_only=False)
File "/opt/NeMo-Aligner/nemo_aligner/models/nlp/gpt/gpt_sft_model.py", line 93, in get_loss_and_metrics
losses_reduced = fwd_bwd_function(
File "/opt/megatron-lm/megatron/core/pipeline_parallel/schedules.py", line 439, in forward_backward_no_pipelining
output_tensor, num_tokens = forward_step(
File "/opt/megatron-lm/megatron/core/pipeline_parallel/schedules.py", line 264, in forward_step
output_tensor, loss_func = forward_step_func(data_iterator, model)
File "/opt/NeMo/nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py", line 1273, in fwd_output_and_loss_func
output_tensor = model(**forward_args)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
File "/opt/megatron-lm/megatron/core/models/gpt/gpt_model.py", line 191, in forward
hidden_states = self.decoder(
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
File "/opt/megatron-lm/megatron/core/transformer/transformer_block.py", line 411, in forward
hidden_states, context = layer(
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
File "/opt/megatron-lm/megatron/core/transformer/transformer_layer.py", line 178, in forward
attention_output_with_bias = self.self_attention(
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
File "/opt/NeMo/nemo/collections/nlp/modules/common/megatron/adapters/mcore_mixins.py", line 202, in forward
query, key, value = self.get_query_key_value_tensors(hidden_states, key_value_states)
File "/opt/NeMo/nemo/collections/nlp/modules/common/megatron/adapters/mcore_mixins.py", line 117, in get_query_key_value_tensors
mixed_qkv = mixed_qkv + lora_mixed_qkv
RuntimeError: The size of tensor a (400) must match the size of tensor b (1600) at non-singleton dimension 0
And I try to fix this by modify the following two .py files: /opt/NeMo/nemo/collections/nlp/modules/common/megatron/adapters/mcore_mixins.py
--- mcore_mixins.py 2024-08-28 02:51:12.000000000 +0000
+++ /opt/NeMo/nemo/collections/nlp/modules/common/megatron/adapters/mcore_mixins.py 2024-08-28 02:51:13.952291250 +0000
@@ -379,7 +379,6 @@
if lora_fc2_adapter and self.adapter_cfg[AdapterName.LORA_4HtoH_ADAPTER]['enabled']:
lora_output = lora_fc2_adapter(intermediate_parallel)
elif lora_moe_fc2_adapter and self.adapter_cfg[AdapterName.LORA_MOE_4HtoH_ADAPTER]['enabled']:
- lora_moe_fc2_adapter.expert_adapters[expert_idx].input_is_parallel=False
lora_output = lora_moe_fc2_adapter(intermediate_parallel, expert_idx)
output = output + lora_output
/opt/NeMo/nemo/collections/nlp/modules/common/megatron/adapters/parallel_adapters.py
--- parallel_adapters.py 2024-08-28 02:52:03.000000000 +0000
+++ /opt/NeMo/nemo/collections/nlp/modules/common/megatron/adapters/parallel_adapters.py 2024-08-28 02:52:04.712871536 +0000
@@ -291,7 +291,7 @@
if self.norm_position == 'pre':
x = self.layer_norm(x)
- if self._sequence_parallel and not self.input_is_parallel and self.norm_position=='pre':
+ if self._sequence_parallel and not self.input_is_parallel:
# for attention_qkv and linear_fc1
# layernorm before lora is impacted by sequence parallel,
# hence seq dim need to be gathered right before lora linear layers
After the modifications, I can run sft with LoRA and tensor and sequence parallel, but I am not sure it runs correctly. Hope you guys can provide elegant solutions for it.
Expected behavior
LoRA can be used with tensor and sequence parallel.
Environment overview (please complete the following information)
nvcr.io/nvidia/nemo:24.07 docker run
HERE=$(pwd -P)
user=`whoami`
uid=`id -u`
gid=`id -g`
docker run \
-v /dev/shm:/dev/shm\
-v /data:/data \
-e USER=$user -e UID=$uid -e GID=$gid \
-v $HERE:/home/dir/ \
-w /home/dir/ \
--security-opt \
seccomp=unconfined \
-it \
--rm \
--network=host \
--gpus all \
nvcr.io/nvidia/nemo:24.07
Environment details
I used the default environment of the nemo docker