youngrok cha
youngrok cha
### Bug description I tried to train huggingface transformers model with deepspeed_stage3, but when I load model with checkpoint like the code below, error occurs. I think checkpoint and model...
I've run a number of experiments and it looks like that most of the performance comes from enabling pos_shift. ``` python examples/eval_long_ppl.py --model_name_or_path lmsys/vicuna-13b-v1.3 --num_samples 8 6.840701103210449 python examples/eval_long_ppl.py --model_name_or_path...
merging with v0.1.2 result in really small sized ckpt about 1/4 for me even though I set type = float32 v0.1.1 works fine tho
Any plan to support Multi-GPU use? I'm totally new to this project, so I'm not sure about this, but it doesn't seem like to support multi-GPU for now.
## Summary Try to fix https://github.com/linkedin/Liger-Kernel/issues/439. As above issue addressed, chunk hidden state across batch-dimension has restrictive benefits. Therefore I try to chunk hidden state across (batch*seq_len)-dimension. As it requires...
After this commit (https://github.com/deepspeedai/DeepSpeed/pull/4906), secondary partitioned tensors are updated only after optimizer.step(). When loading state_dict or resizing embedding after init, secondary partitioned tensors should be updated. e.g., https://github.com/huggingface/transformers/blob/1c4b62b219323a31011bac3bd3cece7675d9e4c3/src/transformers/integrations/deepspeed.py#L344
### System Info OS: Ubuntu 22.04.5 LTS Python version: 3.11.11 GPU: A100-80GB Driver version: 565.57.01 CUDA version: 12.7 bitsandbytes version: 0.45.5 ### Reproduction ```code=python a = torch.tensor([i / 10 for...