transformers Support resuming of deepspeed + Lora + offloading

This PR is a upstream version of @kazemf78 PR to support resuming of Lora training when using deepspeed. Without setting load_module_strict=False as a default, checkpoint is not loaded due to Lora not containing all weights, throwing an error deepspeed resume Error(s) in loading state_dict for PeftModelForCausalLM

Related discussion: https://github.com/huggingface/peft/issues/746

Feb 14 '24 08:02 thepowerfuldeez

cc @pacman100 @younesbelkada

Feb 14 '24 11:02 amyeroberts

Could you please provide any updates on this PR?

Feb 20 '24 22:02 kazemf78

Sure @thepowerfuldeez ! @pacman100 is currently working on fixing issues with repsect to deepspeed and providing working scripts that you can run out of the box: https://github.com/huggingface/peft/pull/1489 we'll review this PR asap with sourab!

Feb 21 '24 02:02 younesbelkada

Hello, this has been already fixed in https://github.com/huggingface/transformers/pull/28746. I ran experiments today and can confirm resuming training when using PEFT+DeepSpeed works

Feb 21 '24 13:02 pacman100

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Mar 18 '24 08:03 github-actions[bot]

I'd like to bump this I'm running into the this issue on my project and it's causing significant delays. Does this branch solve the issue but is not tested enough for merging to main or is a solution still to be found?

Mar 18 '24 10:03 ambroser53

Hi @ambroser53 ! I haven’t tested this branch on an upstream, but it should work.

Mar 18 '24 10:03 thepowerfuldeez

I'd like to bump this I'm running into the this issue on my project and it's causing significant delays. Does this branch solve the issue but is not tested enough for merging to main or is a solution still to be found?

Please see my comments above, the transformers PR https://github.com/huggingface/transformers/pull/28746 should already be fixing this.

Mar 18 '24 12:03 pacman100