Failed to Resume Training from LoRA Adapter Checkpoint

Open tommycwh opened this issue 9 months ago • 0 comments

Following the example of finetuning phi-3.5-vision using qlora from Phi Cookbook link, I was running the finetuning script but got interrupted so I want to resume the finetuning process. Here, not only I want to restore the model weights, I also want to restore the optimizer states (e.g., training steps, learning rate) so that the finetuning prcoess can be resumed from where it was interrupted. However, when the trainer tries to load the checkpoint files, it gives this error:

[rank0]: Traceback (most recent call last):
[rank0]:   File "Phi-3CookBook/code/04.Finetuning/vision_finetuning/finetune_hf_trainer_docvqa.py", line 568, in <module>
[rank0]:     main()
[rank0]:   File "Phi-3CookBook/code/04.Finetuning/vision_finetuning/finetune_hf_trainer_docvqa.py", line 510, in main
[rank0]:     trainer.train(resume_from_checkpoint=args.ckpt_path)
[rank0]:   File "lib/python3.11/site-packages/transformers/trainer.py", line 2138, in train
[rank0]:     state = TrainerState.load_from_json(os.path.join(resume_from_checkpoint, TRAINER_STATE_NAME))
[rank0]:             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "lib/python3.11/site-packages/transformers/trainer_callback.py", line 149, in load_from_json
[rank0]:     with open(json_path, "r", encoding="utf-8") as f:
[rank0]:          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: FileNotFoundError: [Errno 2] No such file or directory: 'Phi-3CookBook/code/04.Finetuning/vision_finetuning/output/trainer_state.json'

It seems that the file "trainer_state.json" is required but it is not found from the saved checkpoint files.

More about the script I ran, I was using "code/03.Finetuning/vision_finetuning/finetune_hf_trainer_docvqa.py". I made some modifications because the original script does not seem to be prepared for resuming from a checkpoint.

The modifications I made:

...
with accelerator.local_main_process_first():
       processor = AutoProcessor.from_pretrained(
            args.model_name_or_path, trust_remote_code=True, num_crops=args.num_crops
        )
        ....
##### Modification: to avoid error from save_pretrained: 
#   AttributeError: 'Phi3VProcessor' object has no attribute 'chat_template'
processor.chat_template = processor.tokenizer.chat_template

...

##### Modification: to reduce number of train/eval samples and training steps for quickly testing
train_dataset = train_dataset.take(10)
eval_dataset = eval_dataset.take(10)

training_args = TrainingArguments(
        ...
        max_steps=args.max_steps,
        save_steps=args.save_steps,
)

##### Modification: to resume from adapter checkpoint
trainer.train(resume_from_checkpoint=args.ckpt_path)

The command I used to run the script:

# For training
# (Only 5 training steps is set for quicker training)
TOKENIZERS_PARALLELISM=false \
torchrun \
--nproc_per_node=1 \
finetune_hf_trainer_docvqa.py \
--bf16 \
--use_qlora \
--batch_size 1 \
--lora_rank 1 \
--lora_alpha_ratio 1 \
--num_train_epochs 1 \
--max_steps 5 \
--save_steps 1

# For resuming
# (max_steps is set to 10 > 5 for further training)
TOKENIZERS_PARALLELISM=false \
torchrun \
--nproc_per_node=1 \
finetune_hf_trainer_docvqa.py \
--bf16 \
--use_qlora \
--batch_size 1 \
--lora_rank 1 \
--lora_alpha_ratio 1 \
--num_train_epochs 1 \
--max_steps 10 \
--save_steps 1 \
--ckpt_path "Phi-3CookBook/code/04.Finetuning/vision_finetuning/output"

After training, the checkpoint folder contains these files:

but the file "trainer_state.json" is not found. I tried to look for methods to prepare "trainer_state.json" but I did not get any result.

To summarize, using the provided script finetune_hf_trainer_docvqa.py to finetune phi3.5-vision with qlora, the file "trainer_state.json" is not saved and this makes it fails to resume a previous training process. So, I want to ask how to make the trainer to save the "trainer_state.json" files, or, what is the expected way to resume training from a lora adapter checkpoint, including the training/optimzier states.

Thank you.

May 23 '25 13:05 tommycwh