DeepSpeed [BUG] Cannot free parameter with ZeRO3 + offload parameter in Pytorch1.9

Describe the bug When tranining llama 13B(https://github.com/ymcui/Chinese-LLaMA-Alpaca), I observed it cannot free parameter memory using ZeRO3 + Offload strategy parameter in pytorch1.9, but parameter memory can be freed in pytorch1.13 with the same deepspeed strategy. This issue(https://github.com/microsoft/DeepSpeed/issues/3002) cannot solve this bug.

To Reproduce deepspeed0.9.2 + pytorch1.9 + peft 0.3 + transformers4.28.1 ds_config

  1 {
  2     "zero_optimization": {
  3         "stage": 3,
  4         "offload_optimizer": {
  5             "device": "cpu",
  6             "pin_memory": true
  7         },
  8         "offload_param": {
  9             "device": "cpu",
 10             "pin_memory": true
 11         },
 12         "overlap_comm": true,
 13         "contiguous_gradients": true,
 14         "sub_group_size": 1e9,
 15         "reduce_bucket_size": "auto",
 16         "stage3_prefetch_bucket_size": "auto",
 17         "stage3_param_persistence_threshold": "auto",
 18         "stage3_max_live_parameters": 1e9,
 19         "stage3_max_reuse_distance": 1e9,
 20         "stage3_gather_16bit_weights_on_model_save": true                                                                                                                                                
 21     },
 22   "train_batch_size": 1,
 23   "train_micro_batch_size_per_gpu": 1,
 24   "fp16": {
 25         "enabled": "auto",
 26         "loss_scale": 0,
 27         "loss_scale_window": 1000,
 28         "initial_scale_power": 16,
 29         "hysteresis": 2,
 30         "min_loss_scale": 1
 31   },
 32    "optimizer": {
 33        "type": "Adam",
 34        "params": {
 35          "lr": "auto",
 36          "betas": "auto",
 37          "eps": "auto",
 38          "weight_decay": "auto"
 39        }
 40    }
 41 }  
Run llama 13B model.

Expected behavior A clear and concise description of what you expected to happen. Pytorch 1.13, deepspeed 0.9.2 Pytorch 1.9, deepspeed 0.9.2

ds_report output Please run ds_report to give us details about your setup.

DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  async_io: please install the libaio-dev package with apt
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
async_io ............... [NO] ....... [NO]
cpu_adagrad ............ [NO] ....... [OKAY]
cpu_adam ............... [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
random_ltd ............. [NO] ....... [OKAY]
 [WARNING]  please install triton==1.0.0 if you want to use sparse attention
sparse_attn ............ [NO] ....... [NO]
spatial_inference ...... [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
utils .................. [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/opt/conda/lib/python3.8/site-packages/torch']
torch version .................... 1.9.0a0+c3d40fd
deepspeed install path ........... ['/opt/conda/lib/python3.8/site-packages/deepspeed']
deepspeed info ................... 0.9.2, unknown, unknown
torch cuda version ............... 11.3
torch hip version ................ None
nvcc version ..................... 11.3
deepspeed wheel compiled w. ...... torch 1.9, cuda 11.3

Screenshots when calling post_forward_hook, parameters can be freed in pytorch1.13, but cannot be freed in pytorch1.9 Pytorch1.13

Pytorch1.9

System info (please complete the following information):

OS: [e.g. Ubuntu 18.04]
GPU count and types [e.g. one machines with x1 A100 each]
Interconnects (if applicable) [e.g., two machines connected with 100 Gbps IB]
Python version 3.8
Any other relevant info about your setup

Launcher context Are you launching your experiment with the deepspeed launcher, MPI, or something else? deepspeed launcher Docker context Are you using a specific docker image that you can share? NGC22.07(Pytorch1.13) and NGC21.06(Pytorch1.9) Additional context Add any other context about the problem here.

May 31 '23 10:05 Andy666G

Hi @Andy666G can you provide a script that reproduces this error?

Jun 06 '23 17:06 jomayeri

Sure, here is a script @jomayeri The pretrained_model is vicuna-13b, and any “.txt” file can be a dataset. I have provided the deepspeed config above.

# export CUDA_VISIBLE_DEVICES=0
lora_rank=8
lora_trainable="q_proj,v_proj,k_proj,o_proj,gate_proj,down_proj,up_proj"
#modules_to_save="embed_tokens,lm_head"
# modules_to_save="lm_head"
modules_to_save=""
lora_dropout=0.1
pretrained_model="models/vicuna-13b-all-v1.1/"
chinese_tokenizer_path="models/vicuna-13b-all-v1.1/tokenizer.model"
dataset_dir="/doc"
data_cache="$PWD/cache"
per_device_batch_size=1 # 1024 ,from https://github.com/ymcui/Chinese-LLaMA-Alpaca/wiki/%E8%AE%AD%E7%BB%83%E7%BB%86%E8%8A%82
training_steps=7000 # 6000, from https://github.com/ymcui/Chinese-LLaMA-Alpaca/wiki/%E8%AE%AD%E7%BB%83%E7%BB%86%E8%8A%82
lr=2.34e-06     # from https://github.com/ymcui/Chinese-LLaMA-Alpaca/wiki/%E8%AE%AD%E7%BB%83%E7%BB%86%E8%8A%82
gradient_accumulation_steps=1
output_dir="output"
max_train_samples=${per_device_batch_size}
max_eval_samples=${per_device_batch_size}

#TODO: deepspeed
deepspeed --include localhost:0 --master_port 12688 scripts/run_clm_pt_with_peft.py \
    --model_name_or_path ${pretrained_model} \
    --tokenizer_name_or_path ${chinese_tokenizer_path} \
    --dataset_dir ${dataset_dir} \
    --data_cache_dir $data_cache \
    --validation_split_percentage 0.001 \
    --per_device_train_batch_size ${per_device_batch_size} \
    --per_device_eval_batch_size ${per_device_batch_size} \
    --do_train \
    --debug_mode \
    --torch_dtype float16 \
    --seed $RANDOM \
    --max_steps ${training_steps} \
    --lr_scheduler_type cosine \
    --learning_rate ${lr} \
    --warmup_ratio 0.05 \
    --weight_decay 0.01 \
    --logging_strategy steps \
    --logging_steps 10 \
    --save_strategy steps \
    --save_total_limit 3 \
    --save_steps 1000 \
    --gradient_accumulation_steps ${gradient_accumulation_steps} \
    --preprocessing_num_workers 8 \
    --block_size 512 \
    --output_dir ${output_dir} \
    --ddp_timeout 30000 \
    --logging_first_step True \
    --lora_rank ${lora_rank} \
    --trainable ${lora_trainable} \
    --lora_dropout ${lora_dropout} \
    --deepspeed deepspeed_config.json \
    --fp16  \
    --overwrite_output_dir \

Jun 07 '23 09:06 Andy666G

I had the same problem and was very confused

Jun 13 '23 08:06 GuWei007

@Andy666G Sorry I cannot repro this issue. In the description you say "when calling post_forward_hook" is this a hook you added? Have you raised the issue with Pytorch?

Jul 10 '23 20:07 jomayeri

Closing for now, please reopen if needed.

Aug 11 '23 03:08 jomayeri