DeepSpeed
DeepSpeed copied to clipboard
DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.
**Describe the bug** A clear and concise description of what the bug is. Encountered Illegal memory access when using inference engine for roberta model for long sequence (e.g. 512). For...
**Describe the bug** A clear and concise description of what the bug is. When using inference engine for roberta model, the output is unexpected when using batch size > 1....
Refactor DeepSpeed Config sub-configs (i.e., activation checkpointing, autotuning, comms, compression, monitor, nebula, and profiling) to use the pydantic library.
### **the code is :** (package version: transformers==4.21.1 torch==1.11.0 deepspeed==0.6.5 cuda==11.3 GPU==RTX3090) ``` import torch from transformers import BertTokenizer, BartForConditionalGeneration, BertModel, BertLMHeadModel from transformers.activations import GELUActivation from deepspeed.profiling.flops_profiler import FlopsProfiler...
May I know why [this training code](https://colab.research.google.com/drive/1v5wY22CkyvKPz21tdwSMPv0T3fsIro0D?usp=sharing#scrollTo=6qJRPd9-sEdK) still gives CUDA-out-of-memory issue even after DeepSpeed is turned on ?  See [this](https://github.com/microsoft/DeepSpeed/issues/2029#issuecomment-1229470437) for historical tracking purpose.
Continuing refactor of distributed unit tests started in #2141 and #2180. Also includes a fix for the broken nightly test (lm-eval)
**Describe the bug** Similar to #2233 and #2133 I'm seeing garbage output when using multi-gpu fp16 inference for gpt-neo-x. Running the script below, replacing Gpt-Neo-X with GPT-Neo-2.7B works fine. Output...
Hi, I tested the native AllReduce (deepspeed.comm.all_reduce) and the compressed AllReduce (backend.compressed_allreduce) in DeepSpeed with [this test script](https://github.com/microsoft/DeepSpeed/blob/master/tests/onebit/test_nccl_perf.py). On a ROCm system, we observed 414% performance improvement of switching from...
In my training code, I only save & load the model state_dict (no optimizer states). I find this is good enough with a few steps of warmup, and saves lots...