DeepSpeed issues

[BUG] RuntimeError: Error building extension 'utils' (`ninja` related?)

2

**Describe the bug** As shown in [this notebook](https://gist.github.com/josephrocca/9ec65e8e5804286a475b5b6da85f7a28), I run these commands: ```py pip install deepspeed --upgrade git clone https://github.com/microsoft/DeepSpeedExamples cd DeepSpeedExamples/model_compression/gpt2 pip install -r requirements.txt sudo apt-get install ninja-build...

josephrocca

bug

[deepspeed/autotuner] Bug fix for binary search for batch size

5

The current implementation terminates when `low = high - 1` and in the process skips checking `high` as a possible value for the `max_micro_batch_size`.

rahilbathwal5

Simplifying inference API

RezaYazdaniAminabadi

[BUG] AssertionError: Distributed backend is not initialized.

2

**Describe the bug** A clear and concise description of what the bug is. AssertionError: Distributed backend is not initialized. Please set dist_init_required to True or initialize before calling deepspeed.initialize() **Expected...

chinoll

bug

[deepspeed/autotuner] Bug fix for skipping mbs on gas

As per my understanding, `max_micro_batch_size` does not include the effect of gradient accumulation steps while `max_train_batch_size_per_gpu` does.

rahilbathwal5

[deepspeed/autotuner] Fix for timer for gas > 1

4

The current implementation stops the timer even for non gradient accumulation boundary steps which artificially inflates the throughput when gas > 1.

rahilbathwal5

Deepspeed model engine can not load a saved checkpoint. "RuntimeError: The size of tensor a x must match the size of tensor b y at non-singleton dimension z"

1

During training, I would periodically save a checkpoint using `model_engine.save_checkpoint` However, `model_engine.load_checkpoint` is resulting in this output ``` [2021-07-08 19:55:42,454] [INFO] [state_dict_factory.py:165:check_ckpt_list] checkpoint file list: ['/home/santosh/deepspeed_checkpoints/secondTest/global_step18825/zero_pp_rank_0_mp_rank_00_model_states.pt'] [2021-07-08 19:55:42,468] [INFO] [state_dict_factory.py:55:load]...

Santosh-Gupta

[BUG] cuda OOM on loading opt-66B

4

**Describe the bug** Hello, I got OOM to load the `facebook/opt-66b` to GPUs (upto 96 A100-80) using zero3. I suspect that the model is not divided correctly. I use other...

MikeChenfu

bug

training

Add correctness check for sharded checkpoint test

Discussion on #2379 has indicated that there are correctness issues when loading certain models from sharded checkpoints. Should be merged after #2662 @RezaYazdaniAminabadi

mrwyattii

Remove deprecated `torch._six` imports

3

Closes #2845.

yasyf

DeepSpeed
DeepSpeed copied to clipboard

Metadata

[BUG] RuntimeError: Error building extension 'utils' (`ninja` related?)

[deepspeed/autotuner] Bug fix for binary search for batch size

Simplifying inference API

[BUG] AssertionError: Distributed backend is not initialized.

[deepspeed/autotuner] Bug fix for skipping mbs on gas

[deepspeed/autotuner] Fix for timer for gas > 1

Deepspeed model engine can not load a saved checkpoint. "RuntimeError: The size of tensor a x must match the size of tensor b y at non-singleton dimension z"

[BUG] cuda OOM on loading opt-66B

Add correctness check for sharded checkpoint test

Remove deprecated `torch._six` imports

← Metadata

Owner

Metadata

DeepSpeed DeepSpeed copied to clipboard

Metadata

← Metadata

Owner

Metadata

DeepSpeed
DeepSpeed copied to clipboard