Support resuming of deepspeed + Lora + offloading
This PR is a upstream version of @kazemf78 PR to support resuming of Lora training when using deepspeed.
Without setting load_module_strict=False as a default, checkpoint is not loaded due to Lora not containing all weights, throwing an error deepspeed resume Error(s) in loading state_dict for PeftModelForCausalLM
Related discussion: https://github.com/huggingface/peft/issues/746
cc @pacman100 @younesbelkada
Could you please provide any updates on this PR?
Sure @thepowerfuldeez ! @pacman100 is currently working on fixing issues with repsect to deepspeed and providing working scripts that you can run out of the box: https://github.com/huggingface/peft/pull/1489 we'll review this PR asap with sourab!
Hello, this has been already fixed in https://github.com/huggingface/transformers/pull/28746. I ran experiments today and can confirm resuming training when using PEFT+DeepSpeed works
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
I'd like to bump this I'm running into the this issue on my project and it's causing significant delays. Does this branch solve the issue but is not tested enough for merging to main or is a solution still to be found?
Hi @ambroser53 ! I haven’t tested this branch on an upstream, but it should work.
I'd like to bump this I'm running into the this issue on my project and it's causing significant delays. Does this branch solve the issue but is not tested enough for merging to main or is a solution still to be found?
Please see my comments above, the transformers PR https://github.com/huggingface/transformers/pull/28746 should already be fixing this.