Load finetune model fails
I've fine-turned llama-7b model with command:
#!/bin/bash
deepspeed_args="--num_gpus=8 --master_port=11000"
exp_id=llama-7b-v2
project_dir=XXXX
base_model_path=${project_dir}/models/pinkmanlove/llama-7b-hf
lora_model_path=${project_dir}/models/llama7b-lora-380k
output_dir=${project_dir}/output_models/${exp_id}
log_dir=${project_dir}/log/${exp_id}
dataset_path=${project_dir}/dataset/train_2M_CN/lmflow/
mkdir -p ${output_dir} ${log_dir}
deepspeed ${deepspeed_args} \
finetune.py \
--model_name_or_path ${base_model_path} \
--lora_model_path ${lora_model_path} \
--dataset_path ${dataset_path} \
--output_dir ${output_dir} --overwrite_output_dir \
--num_train_epochs 2 \
--learning_rate 1e-4 \
--block_size 512 \
--per_device_train_batch_size 1 \
--use_lora 1 \
--lora_r 10 \
--deepspeed configs/ds_config_zero3.json \
--run_name finetune_with_lora \
--validation_split_percentage 0 \
--logging_steps 20 \
--do_train \
--ddp_timeout 72000 \
--save_steps 5000 \
--dataloader_num_workers 8 \
| tee ${log_dir}/train.log \
2> ${log_dir}/train.err
The output of the fine-turned model is: ${project_dir}/output_models/llama-7b-v2, it generates checkpoints as:
.
├── adapter_config.json
├── adapter_model.bin
├── all_results.json
├── checkpoint-25000
│ ├── global_step25000
│ │ ├── zero_pp_rank_0_mp_rank_00_model_states.pt
│ │ ├── zero_pp_rank_0_mp_rank_00_optim_states.pt
│ │ ├── zero_pp_rank_1_mp_rank_00_model_states.pt
│ │ ├── zero_pp_rank_1_mp_rank_00_optim_states.pt
│ │ ├── zero_pp_rank_2_mp_rank_00_model_states.pt
│ │ ├── zero_pp_rank_2_mp_rank_00_optim_states.pt
│ │ ├── zero_pp_rank_3_mp_rank_00_model_states.pt
│ │ ├── zero_pp_rank_3_mp_rank_00_optim_states.pt
│ │ ├── zero_pp_rank_4_mp_rank_00_model_states.pt
│ │ ├── zero_pp_rank_4_mp_rank_00_optim_states.pt
│ │ ├── zero_pp_rank_5_mp_rank_00_model_states.pt
│ │ ├── zero_pp_rank_5_mp_rank_00_optim_states.pt
│ │ ├── zero_pp_rank_6_mp_rank_00_model_states.pt
│ │ ├── zero_pp_rank_6_mp_rank_00_optim_states.pt
│ │ ├── zero_pp_rank_7_mp_rank_00_model_states.pt
│ │ └── zero_pp_rank_7_mp_rank_00_optim_states.pt
│ ├── latest
│ ├── pytorch_model.bin
│ ├── rng_state_0.pth
│ ├── rng_state_1.pth
│ ├── rng_state_2.pth
│ ├── rng_state_3.pth
│ ├── rng_state_4.pth
│ ├── rng_state_5.pth
│ ├── rng_state_6.pth
│ ├── rng_state_7.pth
│ ├── special_tokens_map.json
│ ├── tokenizer_config.json
│ ├── tokenizer.model
│ ├── trainer_state.json
│ ├── training_args.bin
│ └── zero_to_fp32.py
├── README.md
├── trainer_state.json
└── train_results.json
When I tried to load model with command
./scripts/run_chatbot.sh \
${project_dir}/output_models/llama-7b-v2 \
${project_dir}/models/llama7b-lora-380k
I complains OSError: ${project_dir}/output_models/llama-7b-v2 does not appear to have a file named config.json. Checkout 'https://huggingface.co/${project_dir}//output_models/llama-7b-v2/None' for available files..
However followed the suggestion 290 to load model with command
./scripts/run_chatbot.sh \
${project_dir}/models/pinkmanlove/llama-7b-hf/ \
${project_dir}/output_models/llama-7b-v2/
I got the error log:
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:10<00:00, 5.18s/it]
Traceback (most recent call last):
File "/apdcephfs/share_698083/xishengzhao/LMFlow/examples/chatbot.py", line 159, in <module>
main()
File "/apdcephfs/share_698083/xishengzhao/LMFlow/examples/chatbot.py", line 73, in main
model = AutoModel.get_model(
File "/LMFlow/src/lmflow/models/auto_model.py", line 14, in get_model
return HFDecoderModel(model_args, *args, **kwargs)
File "/LMFlow/src/lmflow/models/hf_decoder_model.py", line 219, in __init__
self.backend_model = PeftModel.from_pretrained(
File "/venv/lib/python3.9/site-packages/peft/peft_model.py", line 161, in from_pretrained
model = set_peft_model_state_dict(model, adapters_weights)
File "/venv/lib/python3.9/site-packages/peft/utils/save_and_load.py", line 74, in set_peft_model_state_dict
model.load_state_dict(peft_model_state_dict, strict=False)
File "/venv/lib/python3.9/site-packages/torch/nn/modules/module.py", line 2041, in load_state_dict
raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for PeftModelForCausalLM:
size mismatch for base_model.model.model.layers.0.self_attn.q_proj.lora_A.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([10, 4096]).
size mismatch for base_model.model.model.layers.0.self_attn.q_proj.lora_B.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([4096, 10]).
Any suggestion for correctly loading fine-turned models. Much appreciate!
Please try the following command:
./scripts/run_chatbot.sh \
pinkmanlove/llama-7b-hf/ \
${project_dir}/output_models/llama-7b-v2/
Please try the following command:
./scripts/run_chatbot.sh \ pinkmanlove/llama-7b-hf/ \ ${project_dir}/output_models/llama-7b-v2/
Sure, I tried like that before. I got the error as I mentioned previously:
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:10<00:00, 5.18s/it]
Traceback (most recent call last):
File "/apdcephfs/share_698083/xishengzhao/LMFlow/examples/chatbot.py", line 159, in <module>
main()
File "/apdcephfs/share_698083/xishengzhao/LMFlow/examples/chatbot.py", line 73, in main
model = AutoModel.get_model(
File "/LMFlow/src/lmflow/models/auto_model.py", line 14, in get_model
return HFDecoderModel(model_args, *args, **kwargs)
File "/LMFlow/src/lmflow/models/hf_decoder_model.py", line 219, in __init__
self.backend_model = PeftModel.from_pretrained(
File "/venv/lib/python3.9/site-packages/peft/peft_model.py", line 161, in from_pretrained
model = set_peft_model_state_dict(model, adapters_weights)
File "/venv/lib/python3.9/site-packages/peft/utils/save_and_load.py", line 74, in set_peft_model_state_dict
model.load_state_dict(peft_model_state_dict, strict=False)
File "/venv/lib/python3.9/site-packages/torch/nn/modules/module.py", line 2041, in load_state_dict
raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for PeftModelForCausalLM:
size mismatch for base_model.model.model.layers.0.self_attn.q_proj.lora_A.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([10, 4096]).
size mismatch for base_model.model.model.layers.0.self_attn.q_proj.lora_B.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([4096, 10]).
I am training model with save_aggregated_lora option and try later.
BTW, save_aggregated_lora feature is unavailable in the latest docker image.
It is weird. Could you ensure the peft version is the same as the requirement? You could try this:
pip uninstall peft
pip install git+https://github.com/huggingface/peft.git@deff03f2c251534fffd2511fc2d440e84cc54b1b
BTW, can you try zero2 rather than zero3? zero3 often cause some problems due to the deepspeed
Any tips for training with zero2? K8S pod get killed due to RAM/GPU Memory overhead.
what is the size of RAM and GPU memory?
Process soon be killed after loading checkpoints, if I specify zero2 config. Even if I reduced the -block_size and --dataloader_num_workers
[2023-04-27 11:10:19,080] [INFO] [launch.py:164:main] Setting CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
[2023-04-27 11:10:23,115] [INFO] [comm.py:652:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
......
trainable params: 4194304 || all params: 6742609920 || trainable%: 0.06220594176090199
trainable params: 4194304 || all params: 6742609920 || trainable%: 0.06220594176090199
trainable params: 4194304 || all params: 6742609920 || trainable%: 0.06220594176090199
trainable params: 4194304 || all params: 6742609920 || trainable%: 0.06220594176090199
trainable params: 4194304 || all params: 6742609920 || trainable%: 0.06220594176090199
trainable params: 4194304 || all params: 6742609920 || trainable%: 0.06220594176090199
trainable params: 4194304 || all params: 6742609920 || trainable%: 0.06220594176090199
trainable params: 4194304 || all params: 6742609920 || trainable%: 0.06220594176090199
RAM consumption grows very fast.
root@server:/trainer# free -g
total used free shared buff/cache available
Mem: 359 59 299 0 0 299
Swap: 719 0 719
root@server:/trainer# free -g
total used free shared buff/cache available
Mem: 359 100 258 0 0 258
Swap: 719 0 719
root@server:/trainer# free -g
total used free shared buff/cache available
Mem: 359 133 225 0 0 226
Swap: 719 0 719
root@server:/trainer# free -g
total used free shared buff/cache available
Mem: 359 195 163 0 0 164
Swap: 719 0 719
root@server:/trainer# free -g
total used free shared buff/cache available
Mem: 359 215 141 0 2 143
Swap: 719 0 719
root@server:/trainer# free -g
total used free shared buff/cache available
Mem: 359 252 100 0 6 106
Swap: 719 0 719
root@server:/trainer# free -g
total used free shared buff/cache available
Mem: 359 278 70 0 9 80
Swap: 719 0 719
root@server:/trainer# free -g
total used free shared buff/cache available
Mem: 359 225 121 0 12 134
Swap: 719 0 719
root@serverr:/trainer# free -g
total used free shared buff/cache available
Mem: 359 213 132 0 13 145
Swap: 719 0 719
what is the size of RAM and GPU memory?
RAM: 360G GPU: V100/32G * 8
What about reducing the number of GPUs? for example, using only one GPU. The resources are sufficient for training I think.
What about reducing the number of GPUs? for example, using only one GPU. The resources are sufficient for training I think.
Taking only one GPU can train model, but it is waste of resource. The size mismatch error for loading model that trained with ZeRO3 configuration has a related issue. Would you please refer to it and fix it, so that we can apply more GPUs in training one single model.
I am having the same problem here. Anyone fixing the zero3 issue here?
Hi, Are you using multiple GPUs as well? If so, could you refer to this issue mentioned by zxsimple https://github.com/huggingface/transformers/issues/20082