DeepSpeed issues

[BUG] Failed to inference Megatron gpt-3 MoE model with `deepspeed.init_inference`

4

**Describe the bug** When I use `deepspeed.init_inference` for Megatron gpt-3 MoE model in [megatron repo](https://github.com/microsoft/Megatron-DeepSpeed), error occurs. There's no problem when I use `deepspeed.init` instead as done in the training...

Gabriel4256

bug

inference

Type checking in zero-3 to avoid silent bugs

This is a follow-up PR to the existing one: https://github.com/microsoft/DeepSpeed/pull/2127 The goal is to keep this PR open and investigate the issue in more detail while the PR above removes...

awan-10

[BUG] Sample alexnet example for flops profiler does not work.

1

**Describe the bug** Sample alexnet example for profiler does not work. https://www.deepspeed.ai/tutorials/flops-profiler/ **To Reproduce** copy-paste the example as-is and run it **Expected behavior** should produce the three results, flops, macs,...

LM-AuroTripathy

bug

[REQUEST] Code sample to use DeepSpeed inference without having to run deepspeed cmd line in production set

1

**Is your feature request related to a problem? Please describe.** The current examples for DeepSpeed inference uses cmd line 'deepspeed' that internally uses launcher modules of deepspeed to initialize the...

dhawalkp

enhancement

Details about BERT Pre-training

Hi , DeepSpeed team. I want to know more details about the example described in [BERT Pre-training](https://www.deepspeed.ai/tutorials/bert-pretraining) It took 8 hr 41 min with 4 DGX-2. I wonder how many...

zionwu

bug

[REQUEST] Example of H5 dataloader based training on azure VM for multi-node

**Is your feature request related to a problem? Please describe.** Deepspeed being a library for high speed training large model but most of the DL developers use Azure VMs and...

vishalghor

enhancement

Issue in deepspeed installation with python 3.8

1

I got the following output when I installed using pip： ``` (yzy) C:\Users\hg>pip install deepspeed Collecting deepspeed Using cached deepspeed-0.7.0.tar.gz (629 kB) Preparing metadata (setup.py) ... error error: subprocess-exited-with-error ×...

yuanzhiyong1999

[BUG]Spend lots of time on loading model with zero_optimization stage=3

**Describe the bug** When I use zero optimization(stage=3),It's spend lots of time on loading model. I'm trying to finetune OPT-66B on 2 node,each node contain 8*NVIDIA A100-SXM(80G),1TB RAM. I have...

forglin

bug

Conda recipe

Hi, On a note related to #325, I have added a conda recipe: https://github.com/conda-forge/staged-recipes/pull/14699 But there seems to be a weird bug related to ninja: https://dev.azure.com/conda-forge/feedstock-builds/_build/results?buildId=307963&view=logs&j=6f142865-96c3-535c-b7ea-873d86b887bd&t=22b0682d-ab9e-55d7-9c79-49f3c3ba4823&l=1431 Any help and/or insight...

sarthakpati

Parameter fusion in optimizer partition makes lamb behaves differently

5

In optimizer partition, the parameters are fused into a big vector and then get partitioned over workers. So the number of chunks can be much lesser than the number of...

szhengac

DeepSpeed
DeepSpeed copied to clipboard

Metadata

[BUG] Failed to inference Megatron gpt-3 MoE model with `deepspeed.init_inference`

Type checking in zero-3 to avoid silent bugs

[BUG] Sample alexnet example for flops profiler does not work.

[REQUEST] Code sample to use DeepSpeed inference without having to run deepspeed cmd line in production set

Details about BERT Pre-training

[REQUEST] Example of H5 dataloader based training on azure VM for multi-node

Issue in deepspeed installation with python 3.8

[BUG]Spend lots of time on loading model with zero_optimization stage=3

Conda recipe

Parameter fusion in optimizer partition makes lamb behaves differently

← Metadata

Owner

Metadata

DeepSpeed DeepSpeed copied to clipboard

Metadata

← Metadata

Owner

Metadata

DeepSpeed
DeepSpeed copied to clipboard