DeepSpeed [BUG] try to finetune a llama 33b on 8*A100 40G, 600G RAM. But always OOM on RAM.

I am fine-tuning the llama 33B Llama model on a server with 8*A100 40G GPUs and 600GB RAM, but I keep running into OOM on RAM. I am mainly using the default zero3.config template.

{
    "fp16": {
        "enabled": "auto",
        "loss_scale": 0,
        "loss_scale_window": 1000,
        "initial_scale_power": 16,
        "hysteresis": 2,
        "min_loss_scale": 1
    },

    "bf16": {
        "enabled": "auto"
    },

    "optimizer": {
        "type": "AdamW",
        "params": {
            "lr": "auto",
            "betas": "auto",
            "eps": "auto",
            "weight_decay": "auto"
        }
    },

    "zero_optimization": {
        "stage": 3,
        "offload_optimizer": {
            "device": "cpu",
            "pin_memory": true
        },
        "offload_param": {
            "device": "cpu",
            "pin_memory": true
        },
        "overlap_comm": true,
        "contiguous_gradients": true,
        "sub_group_size": 1e9,
        "reduce_bucket_size": "auto",
        "stage3_prefetch_bucket_size": "auto",
        "stage3_param_persistence_threshold": "auto",
        "stage3_max_live_parameters": 1e9,
        "stage3_max_reuse_distance": 1e9,
        "stage3_gather_16bit_weights_on_model_save": true
    },

    "gradient_accumulation_steps": "auto",
    "gradient_clipping": "auto",
    "steps_per_print": 2000,
    "train_batch_size": "auto",
    "train_micro_batch_size_per_gpu": "auto",
    "wall_clock_breakdown": false
}

I've tried modifying this config by not offloading parameters and only offloading the optimizer to the CPU, or not offloading parameters and only offloading the optimizer to the NVMe. However, none of these attempts have been successful, as they all result in OOM RAM. Do you have any suggestions for my situation?

May 04 '23 16:05 Dominic789654

same question

May 05 '23 07:05 hujunchao

Have you solved this problem?

May 11 '23 13:05 KelleyYin

Can you please share a stack trace?

Also, please try setting all pin_memory to false.

May 15 '23 15:05 tjruwase

Can you please share a stack trace?

Also, please try setting all pin_memory to false.

I think the OOM issue on the RAM is happening because DeepSpeed is trying to load eight models at the same time, which is causing the CPU memory to not have enough space for offloading. Is there a way in DeepSpeed to set arguments to load the models one by one?

May 15 '23 15:05 Dominic789654

@Dominic789654, what you suggest is theoretically possible. However, without seeing the code, it is unclear to me whether DeepSpeed is actually loading the checkpoints, as opposed to HF for example. So, a stack trace at the minimum would be helpful to understand what is actually going on. Thanks!

May 15 '23 16:05 tjruwase

@Dominic789654 you may try my latest PR https://github.com/microsoft/DeepSpeed/pull/3629 This patch would allow loading checkpoint in serial way, so that it would not lead to memory peak for resume from the checkpoint training.

May 30 '23 01:05 leiwen83

@tjruwase Almost the same setting (finetuning llama 33b on 8*A100 40G, 670G RAM). It looks like it reports CUDA OOM while moving the model to GPUs (33B requires at least 66GB memory). Neither stage3_max_live_parameters nor offloading (to cpu or nvme) matters. For some reason, engine.py L1048 is_zero3_model is False even I set it True in config.

Initializing deepspeed took 18.02s
Traceback (most recent call last):
  File "train_deepspeed.py", line 323, in train
    model_engine, optimizer, _, scheduler = deepspeed.initialize(config=args.deepspeed_config, model=model,
  File "/export/home/project/llm/DeepSpeed/deepspeed/__init__.py", line 165, in initialize
    engine = DeepSpeedEngine(args=args,
  File "/export/home/project/llm/DeepSpeed/deepspeed/runtime/engine.py", line 267, in __init__
    self._configure_distributed_model(model)
  File "/export/home/project/llm/DeepSpeed/deepspeed/runtime/engine.py", line 1049, in _configure_distributed_model
    self.module.to(self.device)
  File "/export/share/ruimeng/env/anaconda/envs/codegen/lib/python3.8/site-packages/transformers/modeling_utils.py", line 1878, in to
    return super().to(*args, **kwargs)
  File "/export/share/ruimeng/env/anaconda/envs/codegen/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1145, in to
    return self._apply(convert)
  File "/export/share/ruimeng/env/anaconda/envs/codegen/lib/python3.8/site-packages/torch/nn/modules/module.py", line 797, in _apply
    module._apply(fn)
  File "/export/share/ruimeng/env/anaconda/envs/codegen/lib/python3.8/site-packages/torch/nn/modules/module.py", line 797, in _apply
    module._apply(fn)
  File "/export/share/ruimeng/env/anaconda/envs/codegen/lib/python3.8/site-packages/torch/nn/modules/module.py", line 797, in _apply
    module._apply(fn)
  [Previous line repeated 2 more times]
  File "/export/share/ruimeng/env/anaconda/envs/codegen/lib/python3.8/site-packages/torch/nn/modules/module.py", line 820, in _apply
    param_applied = fn(param)
  File "/export/share/ruimeng/env/anaconda/envs/codegen/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1143, in convert
    return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 228.00 MiB (GPU 6; 39.59 GiB total capacity; 38.14 GiB already allocated; 226.12 MiB free; 38.36 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Jun 13 '23 05:06 memray

My config file:

{
  "fp16": {
    "enabled": true,
    "auto_cast": false,
    "loss_scale": 0,
    "loss_scale_window": 1000,
    "initial_scale_power": 16,
    "hysteresis": 2,
    "min_loss_scale": 1
  },
  "optimizer": {
    "type": "AdamW",
    "params": {
      "betas": [
        0.9,
        0.999
      ],
      "eps": 1e-8,
      "weight_decay": 0.1
    }
  },
  "scheduler": {
    "type": "WarmupDecayLR",
    "params": {
      "warmup_min_lr": 0,
      "warmup_max_lr": 0.00004,
      "warmup_num_steps": 300,
      "warmup_type": "linear",
      "total_num_steps": 3000
    }
  },
  "zero_optimization": {
    "stage": 3,
    "contiguous_gradients": true,
    "overlap_comm": true,
    "reduce_scatter": true,
    "allgather_bucket_size": 5e8,
    "stage3_max_live_parameters": 1e9,
    "stage3_max_reuse_distance": 1e9,
    "reduce_bucket_size": "auto",
    "stage3_prefetch_bucket_size": "auto",
    "stage3_param_persistence_threshold": "auto",
    "sub_group_size": 1e11,
    "stage3_gather_16bit_weights_on_model_save": true,
    "offload_param": {
      "device": "cpu",
      "pin_memory": false
    },
    "offload_optimizer": {
      "device": "cpu",
      "pin_memory": false
    }
  },
  "gradient_clipping": 1,
  "steps_per_print": 10,
  "wall_clock_breakdown": false,
  "compression_training": {
    "weight_quantization": {
      "shared_parameters": {},
      "different_groups": {}
    },
    "activation_quantization": {
      "shared_parameters": {},
      "different_groups": {}
    },
    "sparse_pruning": {
      "shared_parameters": {},
      "different_groups": {}
    },
    "row_pruning": {
      "shared_parameters": {},
      "different_groups": {}
    },
    "head_pruning": {
      "shared_parameters": {},
      "different_groups": {}
    },
    "channel_pruning": {
      "shared_parameters": {},
      "different_groups": {}
    }
  },
  "train_batch_size": 128,
  "train_micro_batch_size_per_gpu": 1,
  "gradient_accumulation_steps": 16
}

Jun 13 '23 06:06 memray

Wow, this started in May and still hasn't closed, Deepspeed folks are really slow!

Jul 21 '23 23:07 djaym7

Any one find solutions? I try to finetune 33B on 8 * A100 40G, 900G RAM. It consumes 680GB during training, and can not save in the final as saving will lead to OOM on 900G RAM...

Jul 23 '23 04:07 LuJunru

I can't even train 3b model with the same config posted here

Jul 23 '23 04:07 djaym7

@djaym7 I can train 3b, 7b and 13b under same environment. In particular, these 3 models consume normal RAM, e.g. 100G ~ 200G. However, the 33B will dramatically consume CPU RAM over 600G. I think this is due to 33B model is larger than single A100 (40G), and lead to unknown errors.

Jul 23 '23 04:07 LuJunru

Before llama impl is merged in mega-ds, we implemented another llama in our private repo. And we found that U can at most train 13B llama w/o offloading with 8 40GB A100. So I guess U just can't.

Jul 23 '23 04:07 nrailg

@nrailgun Have you tried about w/ offload? In my case, I offload optimizer to RAM for 33B, and it do train smoothly. The issue occurs in saving.

Jul 23 '23 04:07 LuJunru

I am likely doing something wrong, @LuJunru do you have your training code on git?

Jul 23 '23 07:07 djaym7

@djaym7 Not yet, I recommend you to follow alpaca: https://github.com/tatsu-lab/stanford_alpaca. Most of settings are similar.

Jul 23 '23 07:07 LuJunru

Thanks, I was trying stage 1 and 2 deepspeed, will tryout fsdp in trainer too. Thanks

Jul 23 '23 08:07 djaym7

Any one find solutions? I try to finetune 33B on 8 * A100 40G, 900G RAM. It consumes 680GB during training, and can not save in the final as saving will lead to OOM on 900G RAM...

@djaym7 OK, I found it is other process stuck my saving. I can briefly report here that I used 750G ~ 800G RAM for training and saving (the seq len is 2048). It could be finetuned on single node with 8 * A100 40G. If you don't have such RAM capacity, try use multiple nodes, deepspeed can split RAM consumption over nodes.

Jul 24 '23 12:07 LuJunru

@LuJunru how do you make it work on 8*A100 40G? Do you use just the same config as this?

Jul 25 '23 00:07 memray

@memray Exactly. I used deepspeed zero3 offloads + flash attention.

Jul 25 '23 01:07 LuJunru

@LuJunru I have CUDA OOM error every time, even on 16gpu nodes. It moves the model to gpus during initialization, even I use stage3. I will try flash attention.

│ /export/home/project/llm/DeepSpeed/deepspeed/runtime/engine.py:268 in        │
│ __init__                                                                     │
│                                                                              │
│    265 │   │   self.pipeline_parallelism = isinstance(model, PipelineModule) │
│    266 │   │                                                                 │
│    267 │   │   # Configure distributed model                                 │
│ ❱  268 │   │   self._configure_distributed_model(model)                      │
│    269 │   │                                                                 │
│    270 │   │   self._get_model_parameters()                                  │
│    271                                                                       │
│                                                                              │
│ /export/home/project/llm/DeepSpeed/deepspeed/runtime/engine.py:1069 in       │
│ _configure_distributed_model                                                 │
│                                                                              │
│   1066 │   │                                                                 │
│   1067 │   │   # zero.Init() handles device placement of model               │
│   1068 │   │   if not self.dont_change_device:                               │
│ ❱ 1069 │   │   │   self.module.to(self.device)                               │
│   1070 │   │                                                                 │
│   1071 │   │   # MoE related initialization                                  │
│   1072 │   │   for _, module in self.module.named_modules():                 │
│

Jul 25 '23 02:07 memray

@memray You may probably test following official strategies, here's one from HF https://huggingface.co/docs/transformers/main_classes/deepspeed#how-to-choose-which-zero-stage-and-offloads-to-use-for-best-performance:

First of all set batch size to 1 (you can always use gradient accumulation for any desired effective batch size). 1 - Enable --gradient_checkpointing 1 (HF Trainer) or directly model.gradient_checkpointing_enable() - if OOM then 2 - Try ZeRO stage 2 first. if OOM then 3 - Try ZeRO stage 2 + offload_optimizer - if OOM then 4 - Switch to ZeRO stage 3 - if OOM then 5 - Enable offload_param to cpu - if OOM then 6 - Enable offload_optimizer to cpu - if OOM then 7 - If you still can’t fit a batch size of 1 first check various default values and lower them if you can. For example, if you use generate and you don’t use a wide search beam make it narrower as it’d take a lot of memory. 8 - Definitely use mixed half-precision over fp32 - so bf16 on Ampere and higher GPUs and fp16 on older gpu architectures. 9 - If you still OOM you could add more hardware or enable ZeRO-Infinity - that is switch offloads offload_param and offload_optimizer to nvme. You need to make sure it’s a very fast nvme. As an anecdote I was able to infer BLOOM-176B on a tiny GPU using ZeRO-Infinity except it was extremely slow. But it worked!

From my experience, it works at 6.

Jul 25 '23 02:07 LuJunru

@LuJunru Hi, does this mean you have successfully finetuned a 33-B-parameter model using zero stage3 + offload optimizer & param on A100 40G * 8 + 600G CPU RAM? I used A100 80G * 8 + 1T RAM, but still encountered CPU RAM OOM (exitcode: -9). Would you mind sharing your environment configuration, such as the version of deepspeed, flash-attn, and cuda? Also, did you use bf16? Thank you very much!

Jul 25 '23 03:07 s1ghhh

@s1ghhh Sure. Here's some configs:

deepspeed: 0.9.2 torch: 2.0.1 (flash attention is in it) cuda: V11.3.109

I used 800G CPU RAM when I use batch 8, accumulation 2, and received memory pressure warning. Reduce batch will be helpful. I guess you could run with batch 8 under 1T RAM.

Jul 25 '23 03:07 LuJunru

@LuJunru Many Thanks! Would you mind sharing your Deepspeed script, please? I have tried other scripts from this issue and Deepspeed's official default script, but I am hoping to rule out any issues related to the Deepspeed configuration script. Thank you again for your willingness to share. In any case, I will make an effort to try it out and publish the results.

Jul 25 '23 03:07 s1ghhh

@s1ghhh I'm afraid I can't right now. We hope to release it next month.

Jul 25 '23 03:07 LuJunru

@LuJunru I understand your situation. Thanks again.

Jul 25 '23 03:07 s1ghhh

@LuJunru thanks for sharing the information. My code got stuck at here (as shown below), since it moves the whole model to GPU during initialization, training hasn't even started. I don't really understand why it behaves this way... By the way, can you let me know which Huggingface checkpoint you are using? Is it huggyllama/llama-30b?

# zero.Init() handles device placement of model
if not self.dont_change_device:
    self.module.to(self.device)

Jul 25 '23 06:07 memray

@memray I used to meet similar issues. In my situation, it was caused by environmental variable: CUDA_LAUNCH_BLOCKING=1, not sure about your case. I fine-tuned on Vicuna 33B.

Jul 25 '23 06:07 LuJunru

@LuJunru Thanks! But it didn't work out for me :( One last thing to confirm, are you doing full-model tuning or LoRA?

Jul 25 '23 07:07 memray