dim unmatch when doing sft with tensor parallel and sequence parallel and LoRA

Open zhuango opened this issue 1 year ago • 0 comments

Describe the bug

I was training to run sft based on Mixtral-8x7B-instruct model with tensor parallel size=4 (sequence parallel=True) and LoRA (target modules =[all]). It reports that the output dims of original module and the corresponding lora adapter module is not matched so they cannot be added together.

Steps/Code to reproduce bug

I used the recommended docker nvcr.io/nvidia/nemo:24.07 and my scripts is as follow:

export CUDA_VISIBLE_DEVICES="4,5,6,7"
nemo_ckpt_path=/path/to/Mixtral-8x7B-instruct-v0.1-nemo
train_data='/path/to/train.json'
valid_data='/path/to/valid.json'
output='/data/results/mixtral-8x7b-instruct-sft'

python /opt/NeMo-Aligner/examples/nlp/gpt/train_gpt_sft.py \
   trainer.precision=bf16 \
   trainer.num_nodes=1 \
   trainer.devices=4 \
   trainer.sft.max_steps=-1 \
   trainer.sft.limit_val_batches=40 \
   trainer.sft.val_check_interval=1000 \
   trainer.sft.max_epochs=2 \
   model.tensor_model_parallel_size=4 \
   model.sequence_parallel=True \
   model.use_flash_attention=True \
   model.encoder_seq_length=1024 \
   model.megatron_amp_O2=False \
   model.restore_from_path=$nemo_ckpt_path \
   model.optim.lr=5e-6 \
   model.optim.name=fused_adam\
   model.answer_only_loss=True \
   model.data.num_workers=0 \
   model.data.train_ds.micro_batch_size=1 \
   model.data.train_ds.global_batch_size=128 \
   model.data.train_ds.file_path=$train_data \
   model.data.validation_ds.micro_batch_size=1 \
   model.data.validation_ds.global_batch_size=128 \
   model.data.validation_ds.file_path=$valid_data \
   model.peft.peft_scheme=lora \
   model.peft.lora_tuning.target_modules=[attention_dense,mlp_fc1,mlp_fc2] \
   model.peft.lora_tuning.adapter_dim=128 \
   exp_manager.create_wandb_logger=False \
   exp_manager.explicit_log_dir=$output \
   exp_manager.wandb_logger_kwargs.project=sft_run \
   exp_manager.wandb_logger_kwargs.name=ruozhiba_sft_run \
   exp_manager.checkpoint_callback_params.save_nemo_on_train_end=True \
   exp_manager.resume_if_exists=True \
   exp_manager.resume_ignore_no_checkpoint=True \
   exp_manager.create_checkpoint_callback=True \
   exp_manager.checkpoint_callback_params.monitor=validation_loss

And it runs into error like:

  ...
  File "/opt/NeMo-Aligner/nemo_aligner/algorithms/supervised.py", line 145, in train_single_step
    loss_mean, metrics = self.model.get_loss_and_metrics(batch=batch, forward_only=False)
  File "/opt/NeMo-Aligner/nemo_aligner/models/nlp/gpt/gpt_sft_model.py", line 93, in get_loss_and_metrics
    losses_reduced = fwd_bwd_function(
  File "/opt/megatron-lm/megatron/core/pipeline_parallel/schedules.py", line 439, in forward_backward_no_pipelining
    output_tensor, num_tokens = forward_step(
  File "/opt/megatron-lm/megatron/core/pipeline_parallel/schedules.py", line 264, in forward_step
    output_tensor, loss_func = forward_step_func(data_iterator, model)
  File "/opt/NeMo/nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py", line 1273, in fwd_output_and_loss_func
    output_tensor = model(**forward_args)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/megatron-lm/megatron/core/models/gpt/gpt_model.py", line 191, in forward
    hidden_states = self.decoder(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/megatron-lm/megatron/core/transformer/transformer_block.py", line 411, in forward
    hidden_states, context = layer(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/megatron-lm/megatron/core/transformer/transformer_layer.py", line 178, in forward
    attention_output_with_bias = self.self_attention(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/NeMo/nemo/collections/nlp/modules/common/megatron/adapters/mcore_mixins.py", line 202, in forward
    query, key, value = self.get_query_key_value_tensors(hidden_states, key_value_states)
  File "/opt/NeMo/nemo/collections/nlp/modules/common/megatron/adapters/mcore_mixins.py", line 117, in get_query_key_value_tensors
    mixed_qkv = mixed_qkv + lora_mixed_qkv
RuntimeError: The size of tensor a (400) must match the size of tensor b (1600) at non-singleton dimension 0

And I try to fix this by modify the following two .py files: /opt/NeMo/nemo/collections/nlp/modules/common/megatron/adapters/mcore_mixins.py

--- mcore_mixins.py     2024-08-28 02:51:12.000000000 +0000
+++ /opt/NeMo/nemo/collections/nlp/modules/common/megatron/adapters/mcore_mixins.py     2024-08-28 02:51:13.952291250 +0000
@@ -379,7 +379,6 @@
             if lora_fc2_adapter and self.adapter_cfg[AdapterName.LORA_4HtoH_ADAPTER]['enabled']:
                 lora_output = lora_fc2_adapter(intermediate_parallel)
             elif lora_moe_fc2_adapter and self.adapter_cfg[AdapterName.LORA_MOE_4HtoH_ADAPTER]['enabled']:
-                lora_moe_fc2_adapter.expert_adapters[expert_idx].input_is_parallel=False
                 lora_output = lora_moe_fc2_adapter(intermediate_parallel, expert_idx)

             output = output + lora_output

/opt/NeMo/nemo/collections/nlp/modules/common/megatron/adapters/parallel_adapters.py

--- parallel_adapters.py        2024-08-28 02:52:03.000000000 +0000
+++ /opt/NeMo/nemo/collections/nlp/modules/common/megatron/adapters/parallel_adapters.py        2024-08-28 02:52:04.712871536 +0000
@@ -291,7 +291,7 @@

         if self.norm_position == 'pre':
             x = self.layer_norm(x)
-        if self._sequence_parallel and not self.input_is_parallel and self.norm_position=='pre':
+        if self._sequence_parallel and not self.input_is_parallel:
             # for attention_qkv and linear_fc1
             # layernorm before lora is impacted by sequence parallel,
             # hence seq dim need to be gathered right before lora linear layers

After the modifications, I can run sft with LoRA and tensor and sequence parallel, but I am not sure it runs correctly. Hope you guys can provide elegant solutions for it.

Expected behavior

LoRA can be used with tensor and sequence parallel.

Environment overview (please complete the following information)

nvcr.io/nvidia/nemo:24.07 docker run

HERE=$(pwd -P) 
user=`whoami`
uid=`id -u`
gid=`id -g`

docker run \
  -v /dev/shm:/dev/shm\
  -v /data:/data \
  -e USER=$user -e UID=$uid -e GID=$gid \
  -v $HERE:/home/dir/ \
  -w /home/dir/ \
  --security-opt \
  seccomp=unconfined \
  -it \
  --rm \
  --network=host \
  --gpus all \
  nvcr.io/nvidia/nemo:24.07

Environment details

I used the default environment of the nemo docker

Aug 28 '24 02:08 zhuango