DeepSpeed icon indicating copy to clipboard operation
DeepSpeed copied to clipboard

DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.

Results 1333 DeepSpeed issues
Sort by recently updated
recently updated
newest added

![image](https://github.com/user-attachments/assets/cd270192-c7d8-4664-8f22-10a1dfa81459) I just don't want to print the warning and infos.

enhancement

**Describe the bug** I am training an LLM using DeepSpeed and 12 nodes a 8 V100s per node. My training is generally working well (thanks DeepSpeed), but when I run...

bug
training

Deprecate redundant sequence_data_parallel_group argument. Users/client code will control across which process group Z3 parameters will be partitioned from one of [None, data_parallel_group, sequence_data_parallel].

- Removed conditional imports of TritonMLP and TritonSelfAttention from module level - Implemented lazy imports for Triton modules inside __init__ method - This change aims to resolve circular dependency issues...

I want to use the mixtral 8X7B model for inference, but currently it only supports autoTP. How to add more support to enable it to use more parallelism (e.g. EP,...

enhancement

when I finetune the model use deepspeed on 2 A800,log only contain worker1,no worker2. Is there any way to print the loss of Worker2? The GPUs on both machines are...

**Describe the bug** For ZeRO-3, i'm noticing an increase in training times on g5.48xlarge nodes with torch >= 2.3.1 and CUDA 12.1. I can reproduce this with small and large...

bug
training

I am try DeepSpeed. I am read docs and modify one project for it. And I am get strange result: 1) Original code without any speed up. 1 docker container....

The Nightly CI for https://github.com/microsoft/DeepSpeed/actions/runs/10985541802 failed.

ci-failure