LMFlow Load finetune model fails

I've fine-turned llama-7b model with command:

#!/bin/bash

deepspeed_args="--num_gpus=8 --master_port=11000"

exp_id=llama-7b-v2
project_dir=XXXX
base_model_path=${project_dir}/models/pinkmanlove/llama-7b-hf
lora_model_path=${project_dir}/models/llama7b-lora-380k

output_dir=${project_dir}/output_models/${exp_id}
log_dir=${project_dir}/log/${exp_id}

dataset_path=${project_dir}/dataset/train_2M_CN/lmflow/

mkdir -p ${output_dir} ${log_dir}

deepspeed ${deepspeed_args} \
  finetune.py \
    --model_name_or_path ${base_model_path} \
    --lora_model_path ${lora_model_path} \
    --dataset_path ${dataset_path} \
    --output_dir ${output_dir} --overwrite_output_dir \
    --num_train_epochs 2 \
    --learning_rate 1e-4 \
    --block_size 512 \
    --per_device_train_batch_size 1 \
    --use_lora 1 \
    --lora_r 10 \
    --deepspeed configs/ds_config_zero3.json \
    --run_name finetune_with_lora \
    --validation_split_percentage 0 \
    --logging_steps 20 \
    --do_train \
    --ddp_timeout 72000 \
    --save_steps 5000 \
    --dataloader_num_workers 8 \
    | tee ${log_dir}/train.log \
    2> ${log_dir}/train.err

The output of the fine-turned model is: ${project_dir}/output_models/llama-7b-v2, it generates checkpoints as:

.
├── adapter_config.json
├── adapter_model.bin
├── all_results.json
├── checkpoint-25000
│   ├── global_step25000
│   │   ├── zero_pp_rank_0_mp_rank_00_model_states.pt
│   │   ├── zero_pp_rank_0_mp_rank_00_optim_states.pt
│   │   ├── zero_pp_rank_1_mp_rank_00_model_states.pt
│   │   ├── zero_pp_rank_1_mp_rank_00_optim_states.pt
│   │   ├── zero_pp_rank_2_mp_rank_00_model_states.pt
│   │   ├── zero_pp_rank_2_mp_rank_00_optim_states.pt
│   │   ├── zero_pp_rank_3_mp_rank_00_model_states.pt
│   │   ├── zero_pp_rank_3_mp_rank_00_optim_states.pt
│   │   ├── zero_pp_rank_4_mp_rank_00_model_states.pt
│   │   ├── zero_pp_rank_4_mp_rank_00_optim_states.pt
│   │   ├── zero_pp_rank_5_mp_rank_00_model_states.pt
│   │   ├── zero_pp_rank_5_mp_rank_00_optim_states.pt
│   │   ├── zero_pp_rank_6_mp_rank_00_model_states.pt
│   │   ├── zero_pp_rank_6_mp_rank_00_optim_states.pt
│   │   ├── zero_pp_rank_7_mp_rank_00_model_states.pt
│   │   └── zero_pp_rank_7_mp_rank_00_optim_states.pt
│   ├── latest
│   ├── pytorch_model.bin
│   ├── rng_state_0.pth
│   ├── rng_state_1.pth
│   ├── rng_state_2.pth
│   ├── rng_state_3.pth
│   ├── rng_state_4.pth
│   ├── rng_state_5.pth
│   ├── rng_state_6.pth
│   ├── rng_state_7.pth
│   ├── special_tokens_map.json
│   ├── tokenizer_config.json
│   ├── tokenizer.model
│   ├── trainer_state.json
│   ├── training_args.bin
│   └── zero_to_fp32.py
├── README.md
├── trainer_state.json
└── train_results.json

When I tried to load model with command

./scripts/run_chatbot.sh \
     ${project_dir}/output_models/llama-7b-v2 \
    ${project_dir}/models/llama7b-lora-380k

I complains OSError: ${project_dir}/output_models/llama-7b-v2 does not appear to have a file named config.json. Checkout 'https://huggingface.co/${project_dir}//output_models/llama-7b-v2/None' for available files.. However followed the suggestion 290 to load model with command

./scripts/run_chatbot.sh \
     ${project_dir}/models/pinkmanlove/llama-7b-hf/ \
     ${project_dir}/output_models/llama-7b-v2/

I got the error log:

Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:10<00:00,  5.18s/it]
Traceback (most recent call last):
  File "/apdcephfs/share_698083/xishengzhao/LMFlow/examples/chatbot.py", line 159, in <module>
    main()
  File "/apdcephfs/share_698083/xishengzhao/LMFlow/examples/chatbot.py", line 73, in main
    model = AutoModel.get_model(
  File "/LMFlow/src/lmflow/models/auto_model.py", line 14, in get_model
    return HFDecoderModel(model_args, *args, **kwargs)
  File "/LMFlow/src/lmflow/models/hf_decoder_model.py", line 219, in __init__
    self.backend_model = PeftModel.from_pretrained(
  File "/venv/lib/python3.9/site-packages/peft/peft_model.py", line 161, in from_pretrained
    model = set_peft_model_state_dict(model, adapters_weights)
  File "/venv/lib/python3.9/site-packages/peft/utils/save_and_load.py", line 74, in set_peft_model_state_dict
    model.load_state_dict(peft_model_state_dict, strict=False)
  File "/venv/lib/python3.9/site-packages/torch/nn/modules/module.py", line 2041, in load_state_dict
    raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for PeftModelForCausalLM:
        size mismatch for base_model.model.model.layers.0.self_attn.q_proj.lora_A.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([10, 4096]).
        size mismatch for base_model.model.model.layers.0.self_attn.q_proj.lora_B.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([4096, 10]).

Any suggestion for correctly loading fine-turned models. Much appreciate!

Apr 25 '23 06:04 zxsimple

Please try the following command:

./scripts/run_chatbot.sh \
     pinkmanlove/llama-7b-hf/ \
     ${project_dir}/output_models/llama-7b-v2/

Apr 25 '23 15:04 shizhediao

Please try the following command:

./scripts/run_chatbot.sh \
     pinkmanlove/llama-7b-hf/ \
     ${project_dir}/output_models/llama-7b-v2/

Sure, I tried like that before. I got the error as I mentioned previously:

Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:10<00:00,  5.18s/it]
Traceback (most recent call last):
  File "/apdcephfs/share_698083/xishengzhao/LMFlow/examples/chatbot.py", line 159, in <module>
    main()
  File "/apdcephfs/share_698083/xishengzhao/LMFlow/examples/chatbot.py", line 73, in main
    model = AutoModel.get_model(
  File "/LMFlow/src/lmflow/models/auto_model.py", line 14, in get_model
    return HFDecoderModel(model_args, *args, **kwargs)
  File "/LMFlow/src/lmflow/models/hf_decoder_model.py", line 219, in __init__
    self.backend_model = PeftModel.from_pretrained(
  File "/venv/lib/python3.9/site-packages/peft/peft_model.py", line 161, in from_pretrained
    model = set_peft_model_state_dict(model, adapters_weights)
  File "/venv/lib/python3.9/site-packages/peft/utils/save_and_load.py", line 74, in set_peft_model_state_dict
    model.load_state_dict(peft_model_state_dict, strict=False)
  File "/venv/lib/python3.9/site-packages/torch/nn/modules/module.py", line 2041, in load_state_dict
    raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for PeftModelForCausalLM:
        size mismatch for base_model.model.model.layers.0.self_attn.q_proj.lora_A.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([10, 4096]).
        size mismatch for base_model.model.model.layers.0.self_attn.q_proj.lora_B.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([4096, 10]).

I am training model with save_aggregated_lora option and try later.

BTW, save_aggregated_lora feature is unavailable in the latest docker image.

Apr 26 '23 02:04 zxsimple

It is weird. Could you ensure the peft version is the same as the requirement? You could try this:

pip uninstall peft
pip install git+https://github.com/huggingface/peft.git@deff03f2c251534fffd2511fc2d440e84cc54b1b

Apr 26 '23 03:04 shizhediao

BTW, can you try zero2 rather than zero3? zero3 often cause some problems due to the deepspeed

Apr 26 '23 04:04 hendrydong

Any tips for training with zero2? K8S pod get killed due to RAM/GPU Memory overhead.

Apr 27 '23 01:04 zxsimple

what is the size of RAM and GPU memory?

Apr 27 '23 01:04 shizhediao

Process soon be killed after loading checkpoints, if I specify zero2 config. Even if I reduced the -block_size and --dataloader_num_workers

[2023-04-27 11:10:19,080] [INFO] [launch.py:164:main] Setting CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
[2023-04-27 11:10:23,115] [INFO] [comm.py:652:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
......
trainable params: 4194304 || all params: 6742609920 || trainable%: 0.06220594176090199
trainable params: 4194304 || all params: 6742609920 || trainable%: 0.06220594176090199
trainable params: 4194304 || all params: 6742609920 || trainable%: 0.06220594176090199
trainable params: 4194304 || all params: 6742609920 || trainable%: 0.06220594176090199
trainable params: 4194304 || all params: 6742609920 || trainable%: 0.06220594176090199
trainable params: 4194304 || all params: 6742609920 || trainable%: 0.06220594176090199
trainable params: 4194304 || all params: 6742609920 || trainable%: 0.06220594176090199
trainable params: 4194304 || all params: 6742609920 || trainable%: 0.06220594176090199

RAM consumption grows very fast.

root@server:/trainer# free -g
              total        used        free      shared  buff/cache   available
Mem:            359          59         299           0           0         299
Swap:           719           0         719
root@server:/trainer# free -g
              total        used        free      shared  buff/cache   available
Mem:            359         100         258           0           0         258
Swap:           719           0         719
root@server:/trainer# free -g
              total        used        free      shared  buff/cache   available
Mem:            359         133         225           0           0         226
Swap:           719           0         719
root@server:/trainer# free -g
              total        used        free      shared  buff/cache   available
Mem:            359         195         163           0           0         164
Swap:           719           0         719
root@server:/trainer# free -g
              total        used        free      shared  buff/cache   available
Mem:            359         215         141           0           2         143
Swap:           719           0         719
root@server:/trainer# free -g
              total        used        free      shared  buff/cache   available
Mem:            359         252         100           0           6         106
Swap:           719           0         719
root@server:/trainer# free -g
              total        used        free      shared  buff/cache   available
Mem:            359         278          70           0           9          80
Swap:           719           0         719
root@server:/trainer# free -g
              total        used        free      shared  buff/cache   available
Mem:            359         225         121           0          12         134
Swap:           719           0         719
root@serverr:/trainer# free -g
              total        used        free      shared  buff/cache   available
Mem:            359         213         132           0          13         145
Swap:           719           0         719

Apr 27 '23 03:04 zxsimple

what is the size of RAM and GPU memory?

RAM: 360G GPU: V100/32G * 8

Apr 27 '23 03:04 zxsimple

What about reducing the number of GPUs? for example, using only one GPU. The resources are sufficient for training I think.

Apr 27 '23 13:04 shizhediao

What about reducing the number of GPUs? for example, using only one GPU. The resources are sufficient for training I think.

Taking only one GPU can train model, but it is waste of resource. The size mismatch error for loading model that trained with ZeRO3 configuration has a related issue. Would you please refer to it and fix it, so that we can apply more GPUs in training one single model.

May 04 '23 04:05 zxsimple

I am having the same problem here. Anyone fixing the zero3 issue here?

Jun 25 '23 02:06 alibabadoufu

Hi, Are you using multiple GPUs as well? If so, could you refer to this issue mentioned by zxsimple https://github.com/huggingface/transformers/issues/20082

Jun 26 '23 00:06 shizhediao