youngrok cha issues

Results 7 issues of


                                            youngrok cha

configure_model with deepspeed_stage3 goes wrong

### Bug description I tried to train huggingface transformers model with deepspeed_stage3, but when I load model with checkpoint like the code below, error occurs. I think checkpoint and model...

bug

docs

3rd party

ver: 2.1.x

Run with start_size=0 looks just fine

I've run a number of experiments and it looks like that most of the performance comes from enabling pos_shift. ``` python examples/eval_long_ppl.py --model_name_or_path lmsys/vicuna-13b-v1.3 --num_samples 8 6.840701103210449 python examples/eval_long_ppl.py --model_name_or_path...

[bug] merging with v0.1.2 result in really small sized ckpt

merging with v0.1.2 result in really small sized ckpt about 1/4 for me even though I set type = float32 v0.1.1 works fine tho

[Feature Request] Multi-GPU support

Any plan to support Multi-GPU use? I'm totally new to this project, so I'm not sure about this, but it doesn't seem like to support multi-GPU for now.

Chunked DPO

## Summary Try to fix https://github.com/linkedin/Liger-Kernel/issues/439. As above issue addressed, chunk hidden state across batch-dimension has restrictive benefits. Therefore I try to chunk hidden state across (batch*seq_len)-dimension. As it requires...

[bugfix] update results of state_dict loading, embedding resizing to secondary partitions (hpz)

After this commit (https://github.com/deepspeedai/DeepSpeed/pull/4906), secondary partitioned tensors are updated only after optimizer.step(). When loading state_dict or resizing embedding after init, secondary partitioned tensors should be updated. e.g., https://github.com/huggingface/transformers/blob/1c4b62b219323a31011bac3bd3cece7675d9e4c3/src/transformers/integrations/deepspeed.py#L344

lib.cquantize_blockwise_fp32 mutates absmax

### System Info OS: Ubuntu 22.04.5 LTS Python version: 3.11.11 GPU: A100-80GB Driver version: 565.57.01 CUDA version: 12.7 bitsandbytes version: 0.45.5 ### Reproduction ```code=python a = torch.tensor([i / 10 for...