[BUG] Cannot free parameter with ZeRO3 + offload parameter in Pytorch1.9
Describe the bug When tranining llama 13B(https://github.com/ymcui/Chinese-LLaMA-Alpaca), I observed it cannot free parameter memory using ZeRO3 + Offload strategy parameter in pytorch1.9, but parameter memory can be freed in pytorch1.13 with the same deepspeed strategy. This issue(https://github.com/microsoft/DeepSpeed/issues/3002) cannot solve this bug.
To Reproduce deepspeed0.9.2 + pytorch1.9 + peft 0.3 + transformers4.28.1 ds_config
1 {
2 "zero_optimization": {
3 "stage": 3,
4 "offload_optimizer": {
5 "device": "cpu",
6 "pin_memory": true
7 },
8 "offload_param": {
9 "device": "cpu",
10 "pin_memory": true
11 },
12 "overlap_comm": true,
13 "contiguous_gradients": true,
14 "sub_group_size": 1e9,
15 "reduce_bucket_size": "auto",
16 "stage3_prefetch_bucket_size": "auto",
17 "stage3_param_persistence_threshold": "auto",
18 "stage3_max_live_parameters": 1e9,
19 "stage3_max_reuse_distance": 1e9,
20 "stage3_gather_16bit_weights_on_model_save": true
21 },
22 "train_batch_size": 1,
23 "train_micro_batch_size_per_gpu": 1,
24 "fp16": {
25 "enabled": "auto",
26 "loss_scale": 0,
27 "loss_scale_window": 1000,
28 "initial_scale_power": 16,
29 "hysteresis": 2,
30 "min_loss_scale": 1
31 },
32 "optimizer": {
33 "type": "Adam",
34 "params": {
35 "lr": "auto",
36 "betas": "auto",
37 "eps": "auto",
38 "weight_decay": "auto"
39 }
40 }
41 }
Run llama 13B model.
Expected behavior
A clear and concise description of what you expected to happen.
Pytorch 1.13, deepspeed 0.9.2
Pytorch 1.9, deepspeed 0.9.2
ds_report output
Please run ds_report to give us details about your setup.
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
runtime if needed. Op compatibility means that your system
meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
[WARNING] async_io requires the dev libaio .so object and headers but these were not found.
[WARNING] async_io: please install the libaio-dev package with apt
[WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
async_io ............... [NO] ....... [NO]
cpu_adagrad ............ [NO] ....... [OKAY]
cpu_adam ............... [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
random_ltd ............. [NO] ....... [OKAY]
[WARNING] please install triton==1.0.0 if you want to use sparse attention
sparse_attn ............ [NO] ....... [NO]
spatial_inference ...... [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
utils .................. [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/opt/conda/lib/python3.8/site-packages/torch']
torch version .................... 1.9.0a0+c3d40fd
deepspeed install path ........... ['/opt/conda/lib/python3.8/site-packages/deepspeed']
deepspeed info ................... 0.9.2, unknown, unknown
torch cuda version ............... 11.3
torch hip version ................ None
nvcc version ..................... 11.3
deepspeed wheel compiled w. ...... torch 1.9, cuda 11.3
Screenshots
when calling post_forward_hook, parameters can be freed in pytorch1.13, but cannot be freed in pytorch1.9
Pytorch1.13
Pytorch1.9
System info (please complete the following information):
- OS: [e.g. Ubuntu 18.04]
- GPU count and types [e.g. one machines with x1 A100 each]
- Interconnects (if applicable) [e.g., two machines connected with 100 Gbps IB]
- Python version 3.8
- Any other relevant info about your setup
Launcher context
Are you launching your experiment with the deepspeed launcher, MPI, or something else?
deepspeed launcher
Docker context
Are you using a specific docker image that you can share?
NGC22.07(Pytorch1.13) and NGC21.06(Pytorch1.9)
Additional context
Add any other context about the problem here.
Hi @Andy666G can you provide a script that reproduces this error?
Sure, here is a script @jomayeri The pretrained_model is vicuna-13b, and any “.txt” file can be a dataset. I have provided the deepspeed config above.
# export CUDA_VISIBLE_DEVICES=0
lora_rank=8
lora_trainable="q_proj,v_proj,k_proj,o_proj,gate_proj,down_proj,up_proj"
#modules_to_save="embed_tokens,lm_head"
# modules_to_save="lm_head"
modules_to_save=""
lora_dropout=0.1
pretrained_model="models/vicuna-13b-all-v1.1/"
chinese_tokenizer_path="models/vicuna-13b-all-v1.1/tokenizer.model"
dataset_dir="/doc"
data_cache="$PWD/cache"
per_device_batch_size=1 # 1024 ,from https://github.com/ymcui/Chinese-LLaMA-Alpaca/wiki/%E8%AE%AD%E7%BB%83%E7%BB%86%E8%8A%82
training_steps=7000 # 6000, from https://github.com/ymcui/Chinese-LLaMA-Alpaca/wiki/%E8%AE%AD%E7%BB%83%E7%BB%86%E8%8A%82
lr=2.34e-06 # from https://github.com/ymcui/Chinese-LLaMA-Alpaca/wiki/%E8%AE%AD%E7%BB%83%E7%BB%86%E8%8A%82
gradient_accumulation_steps=1
output_dir="output"
max_train_samples=${per_device_batch_size}
max_eval_samples=${per_device_batch_size}
#TODO: deepspeed
deepspeed --include localhost:0 --master_port 12688 scripts/run_clm_pt_with_peft.py \
--model_name_or_path ${pretrained_model} \
--tokenizer_name_or_path ${chinese_tokenizer_path} \
--dataset_dir ${dataset_dir} \
--data_cache_dir $data_cache \
--validation_split_percentage 0.001 \
--per_device_train_batch_size ${per_device_batch_size} \
--per_device_eval_batch_size ${per_device_batch_size} \
--do_train \
--debug_mode \
--torch_dtype float16 \
--seed $RANDOM \
--max_steps ${training_steps} \
--lr_scheduler_type cosine \
--learning_rate ${lr} \
--warmup_ratio 0.05 \
--weight_decay 0.01 \
--logging_strategy steps \
--logging_steps 10 \
--save_strategy steps \
--save_total_limit 3 \
--save_steps 1000 \
--gradient_accumulation_steps ${gradient_accumulation_steps} \
--preprocessing_num_workers 8 \
--block_size 512 \
--output_dir ${output_dir} \
--ddp_timeout 30000 \
--logging_first_step True \
--lora_rank ${lora_rank} \
--trainable ${lora_trainable} \
--lora_dropout ${lora_dropout} \
--deepspeed deepspeed_config.json \
--fp16 \
--overwrite_output_dir \
I had the same problem and was very confused
@Andy666G Sorry I cannot repro this issue. In the description you say "when calling post_forward_hook" is this a hook you added? Have you raised the issue with Pytorch?
Closing for now, please reopen if needed.