DeepSpeed issues

Saving a checkpoint when training with NVMe offloading?

20

I am testing NVMe offloading when training a model. When I try to save a checkpoint, I am getting (full stack trace below): ``` NotImplementedError: ZeRO-3 does not yet support...

aciborowska

enhancement

training

Installation on Windows 10 (Deepspeed inference)

9

Hello I'm running Windows 10 and I would like to install DeepSpeed to speed up inference of GPT-J. My system is the following: ``` Windows 10 cuda 11.6 torch 1.13.0...

Eichhof

bug

inference

Problems Installing with Windows

3

Windows 10 Cuda Toolkit = 10.2 PyTorch = 8.1 VisualStudio = 2019 I'm trying to use this with the transformers library and carefully followed the instructions laid out by the...

alexorona

triton version in requirements-sd.txt is not available

## Problem Description requirements-sd.txt required triton==2.0.0.dev20221005 but it doesn't exist in the PyPI triton release histories, the earlist version for triton-v2 that available is triton==2.0.0.dev20221030, the most similar version name...

PanQiWei

bug

compression

fix inference MLP all_reduce dislocation bug

2

baodii

[Error] [Win] Unable to pre-compile async_io on Windows

13

**Describe the bug** I was trying to install Deepspeed on WIndows inside a python virtual environment. I have been told that Deepspeed has not been tested on windows so far...

affanmehmood

bug

Pipeline parallel support for multi-node training?

1

Hello DeepSpeed :) I am trying to use [Pipeline module](https://github.com/microsoft/DeepSpeed/blob/master/deepspeed/runtime/pipe/engine.py) to train a pipeline parallel model on multiple nodes. I am using Slurm as the cluster scheduler, so I initialized...

gajagajago

[BUG] DeepSpeed loads the whole codegen model into GPU

2

I am trying 4x sharing for "Salesforce/codegen-16B-mono". 4 A10 chips (24 GiB each). torch type is torch.half. My math told me (please double check): If the sharding is correct, then...

xiejw

bug

inference

Create tensor parallelism blog/tutorial

Creates blog for automatric tensor parallelism feature.

molly-smith

TP unsupported models and assertions

Expands unsupported model list and adds more checks for clean error exit.

molly-smith

DeepSpeed
DeepSpeed copied to clipboard

Metadata

Saving a checkpoint when training with NVMe offloading?

Installation on Windows 10 (Deepspeed inference)

Problems Installing with Windows

triton version in requirements-sd.txt is not available

fix inference MLP all_reduce dislocation bug

[Error] [Win] Unable to pre-compile async_io on Windows

Pipeline parallel support for multi-node training?

[BUG] DeepSpeed loads the whole codegen model into GPU

Create tensor parallelism blog/tutorial

TP unsupported models and assertions

← Metadata

Owner

Metadata

DeepSpeed DeepSpeed copied to clipboard

Metadata

← Metadata

Owner

Metadata

DeepSpeed
DeepSpeed copied to clipboard