DeepSpeed issues

[BUG] Training with RoPE is broken: Can't stop FP32 layers from being cast to FP16/BF16 during training

**Describe the bug** Many modern transformer components (e.g., RoPE, certain Layer Norm setups) need to be stored and run in FP32. Most of the time, we can accomplish this by...

zaptrem

bug

training

[REQUEST]How to set Ulysses in deepspeed config json?

1

**Is your feature request related to a problem? Please describe.** A clear and concise description of what the problem is. Ex. I'm always frustrated when [...] **Describe the solution you'd...

xs1997zju

enhancement

[BUG] Universal checkpoint conversion - "Cannot find layer_01* files in there"

3

I am tryin to use the universal checkpoint conversion code, `python ds_to_universal.py `, but I get this error that can't find a layer number. I'm not sure why, but I...

exnx

bug

training

[BUG]when using deepspeed loss learning rate did not down

1

**Describe the bug** i want to train a dolly 2.0 2.8b model with using deepspeed but display on terminal is always same did i miss something? without using deepspeed it...

AndyLinOuO

bug

training

[BUG] Expert parallel hangs at the last MoE layer

6

**Describe the bug** I'm using DeepSpeed MoE layer to build a multi-modal LLM, I'm using Phi-3 as the base model, and replaced the MLP layer with MoE layer in DeepSpeed....

JessePrince

bug

training

Add support for Phi-3 small to FastGen

adk9

[BUG] pipline engine's training stucked when zero=1

2

pp_size = 8 stage 0 contains a vision encoder of 45 layers stage 1~7 contain 56 layers of decoder zero 0 is well but zero 1 and bf16/fp16 failed much...

janelu9

bug

compression

[BUG] Excessive CPU and GPU Memory Usage with Multi-GPU Inference Using DeepSpeed

3

I am experiencing excessive CPU and GPU memory usage when running multi-GPU inference with DeepSpeed. Specifically, the memory usage does not scale as expected when increasing the number of GPUs....

gawain000000

bug

inference

[BUG] High memory usage on first GPU, despite perfectly-balanced stages in pipeline

2

**Describe the bug** When using pipelining (with or without `LayerSpec` inside `PipelineModule`), the first GPU seems to have a considerably higher memory consumption, compared to the other ones. This is...

brunomaga

bug

training

[BUG] ZERO++ | AssertionError: ZeRO parameter intra parallel group is already initialized

8

**Describe the bug** Hello. I'm an active user of deepspeed for multi-node training. I've always used zero3, but this time I tried attaching the hpz feature of zero++ for the...

dhkim0225

bug

training

DeepSpeed
DeepSpeed copied to clipboard

Metadata

[BUG] Training with RoPE is broken: Can't stop FP32 layers from being cast to FP16/BF16 during training

[REQUEST]How to set Ulysses in deepspeed config json?

[BUG] Universal checkpoint conversion - "Cannot find layer_01* files in there"

[BUG]when using deepspeed loss learning rate did not down

[BUG] Expert parallel hangs at the last MoE layer

Add support for Phi-3 small to FastGen

[BUG] pipline engine's training stucked when zero=1

[BUG] Excessive CPU and GPU Memory Usage with Multi-GPU Inference Using DeepSpeed

[BUG] High memory usage on first GPU, despite perfectly-balanced stages in pipeline

[BUG] ZERO++ | AssertionError: ZeRO parameter intra parallel group is already initialized

← Metadata

Owner

Metadata

DeepSpeed DeepSpeed copied to clipboard

Metadata

← Metadata

Owner

Metadata

DeepSpeed
DeepSpeed copied to clipboard