DeepSpeed
DeepSpeed copied to clipboard
DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.
Comment in `setup.py` had incorrect environment variable name to set when building a wheel with a .dev specifier.
Hi, I'm trying to do inference using GPT3-like model. When I offloaded the parameters, GPU memory usage reduced I expected. I would like to investigate memory usage and latency by...
Hello everyone, I've always wanted to run large models using minimal GPUs, as I only have a few at my disposal. That is why I was impressed that ZeRO-3 can...
**Describe the bug** There is a problem with asynchronous communication in zero stage2 by using `overlap_comm`. **To Reproduce** Steps to reproduce the behavior: Use deepspeed zero-2 on the hugging face...
**Describe the bug** A clear and concise description of what the bug is. **To Reproduce** Steps to reproduce the behavior: 1. use config below to train ``` "train_batch_size": 64, "gradient_accumulation_steps":...
``` │ /opt/conda/lib/python3.7/site-packages/deepspeed/runtime/zero/stage3.py:307 │ │ in │ │ │ │ 304 │ │ │ max([ │ │ 305 │ │ │ │ max(tensor.numel(), │ │ 306 │ │ │ │...
**Describe the bug** I train the model with zero-2 for multi-node training, and save the model by `model.save_checkpoint`. When I want to get the state dict from `get_fp32_state_dict_from_zero_checkpoint`, it report...
**Describe the bug** The cpu memory usage stays same if I use 1,2,4 gpus, however, if I use 8 gpus, the cpu memory usage increases a lot and makes host...
I am training a 10B model using deepspeed with megatron on A100 GPUS(80G). Here is my ds_report  If I use 4 GPUS, the error is CUDA out of memory...
This is an optimization for reduce behavior in ZeRO stage 2. ZeRO stage 2 assigns ranks to trainable parameters at initial stage. It distributes parameters of each parameter group evenly...