DeepSpeed icon indicating copy to clipboard operation
DeepSpeed copied to clipboard

DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.

Results 1333 DeepSpeed issues
Sort by recently updated
recently updated
newest added

**Describe the bug** As shown in [this notebook](https://gist.github.com/josephrocca/9ec65e8e5804286a475b5b6da85f7a28), I run these commands: ```py pip install deepspeed --upgrade git clone https://github.com/microsoft/DeepSpeedExamples cd DeepSpeedExamples/model_compression/gpt2 pip install -r requirements.txt sudo apt-get install ninja-build...

bug

The current implementation terminates when `low = high - 1` and in the process skips checking `high` as a possible value for the `max_micro_batch_size`.

**Describe the bug** A clear and concise description of what the bug is. AssertionError: Distributed backend is not initialized. Please set dist_init_required to True or initialize before calling deepspeed.initialize() **Expected...

bug

As per my understanding, `max_micro_batch_size` does not include the effect of gradient accumulation steps while `max_train_batch_size_per_gpu` does.

The current implementation stops the timer even for non gradient accumulation boundary steps which artificially inflates the throughput when gas > 1.

During training, I would periodically save a checkpoint using `model_engine.save_checkpoint` However, `model_engine.load_checkpoint` is resulting in this output ``` [2021-07-08 19:55:42,454] [INFO] [state_dict_factory.py:165:check_ckpt_list] checkpoint file list: ['/home/santosh/deepspeed_checkpoints/secondTest/global_step18825/zero_pp_rank_0_mp_rank_00_model_states.pt'] [2021-07-08 19:55:42,468] [INFO] [state_dict_factory.py:55:load]...

**Describe the bug** Hello, I got OOM to load the `facebook/opt-66b` to GPUs (upto 96 A100-80) using zero3. I suspect that the model is not divided correctly. I use other...

bug
training

Discussion on #2379 has indicated that there are correctness issues when loading certain models from sharded checkpoints. Should be merged after #2662 @RezaYazdaniAminabadi