DeepSpeed icon indicating copy to clipboard operation
DeepSpeed copied to clipboard

Suspected memory leak during zero3 training. oom eventually after several checkpoint

Open leiwen83 opened this issue 2 years ago • 10 comments

Hi,

I am using zero3 with latest 0.9.2 to train 8gpus in one node, and from the stat, I see memory grow a lot after each checkpoint saving.

With checking the code, I find one suspected memory leak place, https://github.com/microsoft/DeepSpeed/blob/master/deepspeed/runtime/engine.py#L3257 In _zero3_consolidated_16bit_state_dict, it would alloc [state_dict] local variable to collect params in rank0, and return it to parent function save_16bit_model, where it use self.checkpoint_engine.save to save it to disk.

So [state_dict] is a local variable, is it suitable to pass the param out to the parent function? would it cause some memory leak in this way?

Thx, Lei

leiwen83 avatar May 21 '23 08:05 leiwen83

same problem

lw3259111 avatar May 22 '23 05:05 lw3259111

Hi @leiwen83, @lw3259111, thank you for report this issue. Do you have a small example to reproduce this?

ShijieZZZZ avatar May 26 '23 23:05 ShijieZZZZ

@ShijieZZZZ you can use fastchat(https://github.com/lm-sys/FastChat) with 65B llama model for fine-tune, GPU:8 8A800 80G, memory 2T

deepspeed --num_gpus=8 --master_port=20001 fastchat/train/train_mem.py \
    --model_name_or_path {model_path} \
    --data_path {data_file} \
    --bf16 True \
    --output_dir {output_dir}\
    --num_train_epochs 3 \
    --per_device_train_batch_size 12 \
    --per_device_eval_batch_size 2 \
    --gradient_accumulation_steps 8 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 200 \
    --save_total_limit 10 \
    --learning_rate 2e-5 \
    --logging_steps 1 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --deepspeed "offload_opt_param.json" \
    --tf32 True \
    --model_max_length 2048 \
    --gradient_checkpointing True \
    --lazy_preprocess True

zero3:

{
  "bf16": {
    "enabled": "auto"
  },
  "optimizer": {
    "type": "AdamW",
    "params": {
      "lr": "auto",
      "betas": "auto",
      "eps": "auto",
      "weight_decay": "auto"
    }
  },
  "scheduler": {
    "type": "WarmupDecayLR",
    "params": {
      "total_num_steps": "auto",
      "warmup_min_lr": "auto",
      "warmup_max_lr": "auto",
      "warmup_num_steps": "auto"
    }
  },
  "zero_optimization": {
    "stage": 3,
    "offload_optimizer": {
      "device": "cpu",
      "pin_memory": true
    },
    "offload_param": {
      "device": "cpu",
      "pin_memory": true
    },
    "overlap_comm": true,
    "contiguous_gradients": true,
    "sub_group_size": 1e9,
    "reduce_bucket_size": "auto",
    "stage3_prefetch_bucket_size": "auto",
    "stage3_param_persistence_threshold": "auto",
    "stage3_max_live_parameters": 1e9,
    "stage3_max_reuse_distance": 1e9,
    "stage3_gather_16bit_weights_on_model_save": false
  },
  "gradient_accumulation_steps": "auto",
  "gradient_clipping": "auto",
  "steps_per_print": 5,
  "train_batch_size": "auto",
  "train_micro_batch_size_per_gpu": "auto",
  "wall_clock_breakdown": false
}

lw3259111 avatar May 29 '23 03:05 lw3259111

@leiwen83 @lw3259111 Could you figure out how to solve this? I came across with the same issue.

aneet-javis avatar Dec 15 '23 07:12 aneet-javis