lw3259111 comments

Results 23 comments of


                                            lw3259111

safe_save_model_for_hf_trainer function save mode is very small

> @lw3259111 Just try to get a clarification -- are you training your own llama 33B using deepspeed? yes , I use deepspeed

safe_save_model_for_hf_trainer function save mode is very small

> A takeaway if you want to impl the save weight yourself: > > https://github.com/lm-sys/FastChat/blob/4d33cde2322544532ab940ed1ece1f82d77fe18c/fastchat/train/train_lora.py#L55-L60 > > > But I think hf should have the same code I have rewritten...

Size of saved model checkpoints after trainer.train() is much larger when using trainer with deepspeed stage2

> oh, thank you! now that you're showing the actual file sizes, it's much easier to see what you're talking about. Indeed this looks wrong. > > I have seen...

Size of saved model checkpoints after trainer.train() is much larger when using trainer with deepspeed stage2

@ArvinZhuang I use llama 33B model and the deepspeed config is : ``` { "bf16": { "enabled": "auto" }, "optimizer": { "type": "AdamW", "params": { "lr": "auto", "betas": "auto", "eps":...

Size of saved model checkpoints after trainer.train() is much larger when using trainer with deepspeed stage2

> Please note the discussion continues here: [microsoft/DeepSpeed#3303 (comment)](https://github.com/microsoft/DeepSpeed/issues/3303#issuecomment-1516798523) > > We understand well the cause of the problem - explained at [microsoft/DeepSpeed#3303 (comment)](https://github.com/microsoft/DeepSpeed/issues/3303#issuecomment-1516801635) > > This impacts only z1/z2...

Size of saved model checkpoints after trainer.train() is much larger when using trainer with deepspeed stage2

> please reread the comment you quoted - it says `clone` and then optionally move to cpu. Your code is missing the key operation. I am using the following code,...

transformers trainer llama Trying to resize storage that is not resizable

@amyeroberts ` Copy-and-paste the text below in your GitHub issue and FILL OUT the two last points. - `transformers` version: 4.28.0.dev0 - Platform: Linux-4.15.0-208-generic-x86_64-with-glibc2.29 - Python version: 3.8.10 - Huggingface_hub...

transformers trainer llama Trying to resize storage that is not resizable

@amyeroberts I want to load `LlamaForCausalLM` model and the same error has beed found in follow link ` https://github.com/tatsu-lab/stanford_alpaca/issues/61#issuecomment-1504117715 https://github.com/lm-sys/FastChat/issues/351 `

transformers trainer llama Trying to resize storage that is not resizable

@amyeroberts Thank you for your reply. I will reply to your questions one by one - git-lfs has been installed in my compute - my transformers version is 4.28.0.dev0, [https://github.com/tatsu-lab/stanford_alpaca/issues/61#issuecomment-1504459664](url)....

Suspected memory leak during zero3 training. oom eventually after several checkpoint

same problem