DeepSpeed issues

Fix example command for building wheel with dev version specified.

Comment in `setup.py` had incorrect environment variable name to set when building a wheel with a .dev specifier.

[BUG] using ZeRO-stage 3 with offload, buffer configs don't make memory usage change.

3

Hi, I'm trying to do inference using GPT3-like model. When I offloaded the parameters, GPU memory usage reduced I expected. I would like to investigate memory usage and latency by...

macto94

bug

inference

[REQUEST] ZeRO stage 3 support for mixture-of-experts (MoE) layer

3

Hello everyone, I've always wanted to run large models using minimal GPUs, as I only have a few at my disposal. That is why I was impressed that ZeRO-3 can...

ranggihwang

enhancement

[BUG]There is a problem with asynchronous communication in zero stage2

**Describe the bug** There is a problem with asynchronous communication in zero stage2 by using `overlap_comm`. **To Reproduce** Steps to reproduce the behavior: Use deepspeed zero-2 on the hugging face...

Baibaifan

bug

training

[BUG] `"comms_logger": "enabled": false` makes `overlap_comm` invalid

**Describe the bug** A clear and concise description of what the bug is. **To Reproduce** Steps to reproduce the behavior: 1. use config below to train ``` "train_batch_size": 64, "gradient_accumulation_steps":...

KimmiShi

bug

training

[BUG] ValueError: max() arg is an empty sequence using bf16 zero stage3

14

``` │ /opt/conda/lib/python3.7/site-packages/deepspeed/runtime/zero/stage3.py:307 │ │ in │ │ │ │ 304 │ │ │ max([ │ │ 305 │ │ │ │ max(tensor.numel(), │ │ 306 │ │ │ │...

sujithjoseph

bug

training

[BUG] multi-node training can not get state dict from `get_fp32_state_dict_from_zero_checkpoint`

**Describe the bug** I train the model with zero-2 for multi-node training, and save the model by `model.save_checkpoint`. When I want to get the state dict from `get_fp32_state_dict_from_zero_checkpoint`, it report...

guozhiyao

bug

training

[BUG] cpu memory usage of zero-stage3

1

**Describe the bug** The cpu memory usage stays same if I use 1,2,4 gpus, however, if I use 8 gpus, the cpu memory usage increases a lot and makes host...

CoinCheung

bug

training

[BUG] Get "exits with return code = -9" when Creating fp16 ZeRO stage 2 optimizer

3

I am training a 10B model using deepspeed with megatron on A100 GPUS(80G). Here is my ds_report ![image](https://user-images.githubusercontent.com/56537141/219829623-362d40e8-dc52-4f41-8384-cf76807b5728.png) If I use 4 GPUS, the error is CUDA out of memory...

sxthunder

bug

training

DeepSpeedZeroOptimizer dist.reduce optimization when self.round_robin_gradients is false

8

This is an optimization for reduce behavior in ZeRO stage 2. ZeRO stage 2 assigns ranks to trainable parameters at initial stage. It distributes parameters of each parameter group evenly...

Liangliang-Ma

DeepSpeed
DeepSpeed copied to clipboard

Metadata

Fix example command for building wheel with dev version specified.

[BUG] using ZeRO-stage 3 with offload, buffer configs don't make memory usage change.

[REQUEST] ZeRO stage 3 support for mixture-of-experts (MoE) layer

[BUG]There is a problem with asynchronous communication in zero stage2

[BUG] `"comms_logger": "enabled": false` makes `overlap_comm` invalid

[BUG] ValueError: max() arg is an empty sequence using bf16 zero stage3

[BUG] multi-node training can not get state dict from `get_fp32_state_dict_from_zero_checkpoint`

[BUG] cpu memory usage of zero-stage3

[BUG] Get "exits with return code = -9" when Creating fp16 ZeRO stage 2 optimizer

DeepSpeedZeroOptimizer dist.reduce optimization when self.round_robin_gradients is false

← Metadata

Owner

Metadata

DeepSpeed DeepSpeed copied to clipboard

Metadata

← Metadata

Owner

Metadata

DeepSpeed
DeepSpeed copied to clipboard