transformers icon indicating copy to clipboard operation
transformers copied to clipboard

Support resuming of deepspeed + Lora + offloading

Open thepowerfuldeez opened this issue 1 year ago • 1 comments

This PR is a upstream version of @kazemf78 PR to support resuming of Lora training when using deepspeed. Without setting load_module_strict=False as a default, checkpoint is not loaded due to Lora not containing all weights, throwing an error deepspeed resume Error(s) in loading state_dict for PeftModelForCausalLM

Related discussion: https://github.com/huggingface/peft/issues/746

thepowerfuldeez avatar Feb 14 '24 08:02 thepowerfuldeez

cc @pacman100 @younesbelkada

amyeroberts avatar Feb 14 '24 11:02 amyeroberts

Could you please provide any updates on this PR?

kazemf78 avatar Feb 20 '24 22:02 kazemf78

Sure @thepowerfuldeez ! @pacman100 is currently working on fixing issues with repsect to deepspeed and providing working scripts that you can run out of the box: https://github.com/huggingface/peft/pull/1489 we'll review this PR asap with sourab!

younesbelkada avatar Feb 21 '24 02:02 younesbelkada

Hello, this has been already fixed in https://github.com/huggingface/transformers/pull/28746. I ran experiments today and can confirm resuming training when using PEFT+DeepSpeed works

pacman100 avatar Feb 21 '24 13:02 pacman100

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

github-actions[bot] avatar Mar 18 '24 08:03 github-actions[bot]

I'd like to bump this I'm running into the this issue on my project and it's causing significant delays. Does this branch solve the issue but is not tested enough for merging to main or is a solution still to be found?

ambroser53 avatar Mar 18 '24 10:03 ambroser53

Hi @ambroser53 ! I haven’t tested this branch on an upstream, but it should work.

thepowerfuldeez avatar Mar 18 '24 10:03 thepowerfuldeez

I'd like to bump this I'm running into the this issue on my project and it's causing significant delays. Does this branch solve the issue but is not tested enough for merging to main or is a solution still to be found?

Please see my comments above, the transformers PR https://github.com/huggingface/transformers/pull/28746 should already be fixing this.

pacman100 avatar Mar 18 '24 12:03 pacman100