pro训练时爆OOM
你好我跑PRO训练代码会报OOM,我是80G的A800,训练13B的模型,按道理应该不会爆啊 我把batch size设为1,block_size设为100,还是爆了,不知道问题出在哪? train_hh.sh:
export OMP_NUM_THREADS=16
root_dir=..
#stage 23
id=$1
data_path=$2
ranking_len=$3
mkdir -p $root_dir/logs/$id/$ranking_len
# --main_process_port 29534 \
CUDA_VISIBLE_DEVICES=4,5,7 accelerate launch --num_processes 2 --config_file ds_config.yaml --main_process_port=29534 main.py \
--task hh \
--train_file_path $root_dir/data/${data_path} \
--validation_file_path $root_dir/data/hh_dev \
--validation_file_name sampled_dev.json \
--output_dir $root_dir/checkpoints/index_$id/stage_$ranking_len \
--log_path $root_dir/logs/$id/$ranking_len \
--index $id \
--seed 42 \
--temperature 1 \
--sft_weight 0.05 \
--num_train_epochs 2 \
--training_stage_num $ranking_len \
--block_size 100 \
--learning_rate 5e-6 \
--per_device_train_batch_size 1 \
--per_device_eval_batch_size 1 \
--model_name_or_path /mnt/data2/finLLM/models/tigerbot-13b-base \
--do_train \
--do_validation > $root_dir/logs/$id/$ranking_len/train_detail.log 2>&1
日志:
The following values were not passed to `accelerate launch` and had defaults used instead:
`--dynamo_backend` was set to a value of `'no'`
To avoid this warning pass in values for each of the problematic parameters or run `accelerate config`.
/mnt/data2/miniconda3/envs/pro/lib/python3.9/site-packages/accelerate/utils/dataclasses.py:541: UserWarning: DeepSpeed Zero3 Init flag is only applicable for ZeRO Stage 3. Setting it to False.
warnings.warn("DeepSpeed Zero3 Init flag is only applicable for ZeRO Stage 3. Setting it to False.")
/mnt/data2/miniconda3/envs/pro/lib/python3.9/site-packages/accelerate/utils/dataclasses.py:541: UserWarning: DeepSpeed Zero3 Init flag is only applicable for ZeRO Stage 3. Setting it to False.
warnings.warn("DeepSpeed Zero3 Init flag is only applicable for ZeRO Stage 3. Setting it to False.")
/mnt/data2/miniconda3/envs/pro/lib/python3.9/site-packages/accelerate/utils/dataclasses.py:541: UserWarning: DeepSpeed Zero3 Init flag is only applicable for ZeRO Stage 3. Setting it to False.
warnings.warn("DeepSpeed Zero3 Init flag is only applicable for ZeRO Stage 3. Setting it to False.")
task : hh
do_train : True
do_validation : True
sft_weight : 0.05
index : exp001
seed : 42
temperature : 1.0
training_stage_num : 2
train_file_path : ../data/hh_train_len2
validation_file_path : ../data/hh_dev
validation_file_name : sampled_dev.json
model_name_or_path : /mnt/data2/finLLM/models/tigerbot-13b-base
per_device_train_batch_size : 1
per_device_eval_batch_size : 1
learning_rate : 5e-06
block_size : 100
num_train_epochs : 2
max_train_steps : None
gradient_accumulation_steps : 8
output_dir : ../checkpoints/index_exp001/stage_2
checkpointing_step : 600
log_path : ../logs/exp001/2
Loading checkpoint shards: 0%| | 0/3 [00:00<?, ?it/s]
Loading checkpoint shards: 0%| | 0/3 [00:00<?, ?it/s]
Loading checkpoint shards: 0%| | 0/3 [00:00<?, ?it/s]
Loading checkpoint shards: 33%|███▎ | 1/3 [00:19<00:39, 19.66s/it]
Loading checkpoint shards: 33%|███▎ | 1/3 [00:21<00:42, 21.21s/it]
Loading checkpoint shards: 33%|███▎ | 1/3 [00:21<00:43, 21.95s/it]
Loading checkpoint shards: 67%|██████▋ | 2/3 [00:35<00:17, 17.49s/it]
Loading checkpoint shards: 67%|██████▋ | 2/3 [00:42<00:21, 21.40s/it]
Loading checkpoint shards: 67%|██████▋ | 2/3 [00:44<00:22, 22.51s/it]
Loading checkpoint shards: 100%|██████████| 3/3 [00:46<00:00, 14.24s/it]
Loading checkpoint shards: 100%|██████████| 3/3 [00:46<00:00, 15.34s/it]
Loading checkpoint shards: 100%|██████████| 3/3 [00:57<00:00, 18.19s/it]
Loading checkpoint shards: 100%|██████████| 3/3 [00:57<00:00, 19.11s/it]
Loading checkpoint shards: 100%|██████████| 3/3 [00:58<00:00, 18.65s/it]
Loading checkpoint shards: 100%|██████████| 3/3 [00:58<00:00, 19.57s/it]
[info] args:
Namespace(task='hh', do_train=True, do_validation=True, sft_weight=0.05, index='exp001', seed=42, temperature=1.0, training_stage_num=2, train_file_path='../data/hh_train_len2', validation_file_path='../data/hh_dev', validation_file_name='sampled_dev.json', model_name_or_path='/mnt/data2/finLLM/models/tigerbot-13b-base', per_device_train_batch_size=1, per_device_eval_batch_size=1, learning_rate=5e-06, block_size=100, num_train_epochs=2, max_train_steps=None, gradient_accumulation_steps=8, output_dir='../checkpoints/index_exp001/stage_2', checkpointing_step=600, log_path='../logs/exp001/2')
[info] args:
Namespace(task='hh', do_train=True, do_validation=True, sft_weight=0.05, index='exp001', seed=42, temperature=1.0, training_stage_num=2, train_file_path='../data/hh_train_len2', validation_file_path='../data/hh_dev', validation_file_name='sampled_dev.json', model_name_or_path='/mnt/data2/finLLM/models/tigerbot-13b-base', per_device_train_batch_size=1, per_device_eval_batch_size=1, learning_rate=5e-06, block_size=100, num_train_epochs=2, max_train_steps=None, gradient_accumulation_steps=8, output_dir='../checkpoints/index_exp001/stage_2', checkpointing_step=600, log_path='../logs/exp001/2')[info] args:
Namespace(task='hh', do_train=True, do_validation=True, sft_weight=0.05, index='exp001', seed=42, temperature=1.0, training_stage_num=2, train_file_path='../data/hh_train_len2', validation_file_path='../data/hh_dev', validation_file_name='sampled_dev.json', model_name_or_path='/mnt/data2/finLLM/models/tigerbot-13b-base', per_device_train_batch_size=1, per_device_eval_batch_size=1, learning_rate=5e-06, block_size=100, num_train_epochs=2, max_train_steps=None, gradient_accumulation_steps=8, output_dir='../checkpoints/index_exp001/stage_2', checkpointing_step=600, log_path='../logs/exp001/2')
[2024-03-20 02:13:02,161] [INFO] [logging.py:75:log_dist] [Rank -1] DeepSpeed info: version=0.8.1, git-hash=unknown, git-branch=unknown
[2024-03-20 02:13:02,271] [INFO] [logging.py:75:log_dist] [Rank -1] DeepSpeed info: version=0.8.1, git-hash=unknown, git-branch=unknown
[2024-03-20 02:13:02,299] [INFO] [logging.py:75:log_dist] [Rank -1] DeepSpeed info: version=0.8.1, git-hash=unknown, git-branch=unknown
[2024-03-20 02:14:17,786] [INFO] [logging.py:75:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False
[2024-03-20 02:14:17,787] [INFO] [logging.py:75:log_dist] [Rank 0] Removing param_group that has no 'params' in the client Optimizer
[2024-03-20 02:14:17,787] [INFO] [logging.py:75:log_dist] [Rank 0] Using client Optimizer as basic optimizer
[2024-03-20 02:14:17,831] [INFO] [logging.py:75:log_dist] [Rank 0] DeepSpeed Basic Optimizer = AdamW
[2024-03-20 02:14:17,831] [INFO] [utils.py:53:is_zero_supported_optimizer] Checking ZeRO support for optimizer=AdamW type=<class 'torch.optim.adamw.AdamW'>
[2024-03-20 02:14:17,832] [INFO] [logging.py:75:log_dist] [Rank 0] Creating torch.bfloat16 ZeRO stage 2 optimizer
[2024-03-20 02:14:17,832] [INFO] [stage_1_and_2.py:145:__init__] Reduce bucket size 500,000,000
[2024-03-20 02:14:17,832] [INFO] [stage_1_and_2.py:146:__init__] Allgather bucket size 500,000,000
[2024-03-20 02:14:17,832] [INFO] [stage_1_and_2.py:147:__init__] CPU Offload: False
[2024-03-20 02:14:17,832] [INFO] [stage_1_and_2.py:148:__init__] Round robin gradient partitioning: False
Using /home/zhengmingjie/.cache/torch_extensions/py39_cu117 as PyTorch extensions root...
Using /home/zhengmingjie/.cache/torch_extensions/py39_cu117 as PyTorch extensions root...
Using /home/zhengmingjie/.cache/torch_extensions/py39_cu117 as PyTorch extensions root...
Emitting ninja build file /home/zhengmingjie/.cache/torch_extensions/py39_cu117/utils/build.ninja...
Building extension module utils...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module utils...
Time to load utils op: 0.7705831527709961 seconds
Loading extension module utils...
Time to load utils op: 0.7298614978790283 seconds
Loading extension module utils...
Time to load utils op: 0.7109990119934082 seconds
Rank: 0 partition count [3] and sizes[(4435952640, False)]
Rank: 2 partition count [3] and sizes[(4435952640, False)]
Rank: 1 partition count [3] and sizes[(4435952640, False)]
[2024-03-20 02:15:34,752] [INFO] [utils.py:826:see_memory_usage] Before initializing optimizer states
[2024-03-20 02:15:34,753] [INFO] [utils.py:827:see_memory_usage] MA 41.35 GB Max_MA 49.61 GB CA 49.62 GB Max_CA 50 GB
[2024-03-20 02:15:34,754] [INFO] [utils.py:835:see_memory_usage] CPU Virtual Memory: used = 98.89 GB, percent = 9.8%
Traceback (most recent call last):
File "/mnt/data2/finLLM/DAMO-ConvAI-main/PRO/train/main.py", line 45, in <module>
model = process_manager.train()
File "/mnt/data2/finLLM/DAMO-ConvAI-main/PRO/train/utils/process_manager.py", line 182, in train
model, optimizer, dataset_length = self.init_prepare_train(
File "/mnt/data2/finLLM/DAMO-ConvAI-main/PRO/train/utils/process_manager.py", line 165, in init_prepare_train
model, optimizer, _ = self.accelerator.prepare(
File "/mnt/data2/miniconda3/envs/pro/lib/python3.9/site-packages/accelerate/accelerator.py", line 1090, in prepare
result = self._prepare_deepspeed(*args)
File "/mnt/data2/miniconda3/envs/pro/lib/python3.9/site-packages/accelerate/accelerator.py", line 1368, in _prepare_deepspeed
engine, optimizer, _, lr_scheduler = deepspeed.initialize(**kwargs)
File "/mnt/data2/miniconda3/envs/pro/lib/python3.9/site-packages/deepspeed/__init__.py", line 125, in initialize
engine = DeepSpeedEngine(args=args,
File "/mnt/data2/miniconda3/envs/pro/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 336, in __init__
self._configure_optimizer(optimizer, model_parameters)
File "/mnt/data2/miniconda3/envs/pro/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1292, in _configure_optimizer
self.optimizer = self._configure_zero_optimizer(basic_optimizer)
File "/mnt/data2/miniconda3/envs/pro/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1542, in _configure_zero_optimizer
optimizer = DeepSpeedZeroOptimizer(
File "/mnt/data2/miniconda3/envs/pro/lib/python3.9/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 524, in __init__
self.initialize_optimizer_states()
File "/mnt/data2/miniconda3/envs/pro/lib/python3.9/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 649, in initialize_optimizer_states
self.optimizer.step()
File "/mnt/data2/miniconda3/envs/pro/lib/python3.9/site-packages/torch/optim/optimizer.py", line 280, in wrapper
out = func(*args, **kwargs)
File "/mnt/data2/miniconda3/envs/pro/lib/python3.9/site-packages/torch/optim/optimizer.py", line 33, in _use_grad
ret = func(self, *args, **kwargs)
File "/mnt/data2/miniconda3/envs/pro/lib/python3.9/site-packages/torch/optim/adamw.py", line 160, in step
self._init_group(
File "/mnt/data2/miniconda3/envs/pro/lib/python3.9/site-packages/torch/optim/adamw.py", line 114, in _init_group
state["exp_avg"] = torch.zeros_like(
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 16.53 GiB (GPU 0; 79.15 GiB total capacity; 57.88 GiB already allocated; 5.21 GiB free; 57.88 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Traceback (most recent call last):
File "/mnt/data2/finLLM/DAMO-ConvAI-main/PRO/train/main.py", line 45, in <module>
model = process_manager.train()
File "/mnt/data2/finLLM/DAMO-ConvAI-main/PRO/train/utils/process_manager.py", line 182, in train
model, optimizer, dataset_length = self.init_prepare_train(
File "/mnt/data2/finLLM/DAMO-ConvAI-main/PRO/train/utils/process_manager.py", line 165, in init_prepare_train
model, optimizer, _ = self.accelerator.prepare(
File "/mnt/data2/miniconda3/envs/pro/lib/python3.9/site-packages/accelerate/accelerator.py", line 1090, in prepare
result = self._prepare_deepspeed(*args)
File "/mnt/data2/miniconda3/envs/pro/lib/python3.9/site-packages/accelerate/accelerator.py", line 1368, in _prepare_deepspeed
engine, optimizer, _, lr_scheduler = deepspeed.initialize(**kwargs)
File "/mnt/data2/miniconda3/envs/pro/lib/python3.9/site-packages/deepspeed/__init__.py", line 125, in initialize
engine = DeepSpeedEngine(args=args,
File "/mnt/data2/miniconda3/envs/pro/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 336, in __init__
self._configure_optimizer(optimizer, model_parameters)
File "/mnt/data2/miniconda3/envs/pro/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1292, in _configure_optimizer
self.optimizer = self._configure_zero_optimizer(basic_optimizer)
File "/mnt/data2/miniconda3/envs/pro/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1542, in _configure_zero_optimizer
optimizer = DeepSpeedZeroOptimizer(
File "/mnt/data2/miniconda3/envs/pro/lib/python3.9/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 524, in __init__
self.initialize_optimizer_states()
File "/mnt/data2/miniconda3/envs/pro/lib/python3.9/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 649, in initialize_optimizer_states
self.optimizer.step()
File "/mnt/data2/miniconda3/envs/pro/lib/python3.9/site-packages/torch/optim/optimizer.py", line 280, in wrapper
out = func(*args, **kwargs)
File "/mnt/data2/miniconda3/envs/pro/lib/python3.9/site-packages/torch/optim/optimizer.py", line 33, in _use_grad
ret = func(self, *args, **kwargs)
File "/mnt/data2/miniconda3/envs/pro/lib/python3.9/site-packages/torch/optim/adamw.py", line 160, in step
self._init_group(
File "/mnt/data2/miniconda3/envs/pro/lib/python3.9/site-packages/torch/optim/adamw.py", line 118, in _init_group
state["exp_avg_sq"] = torch.zeros_like(
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 16.53 GiB (GPU 1; 79.15 GiB total capacity; 74.40 GiB already allocated; 4.13 GiB free; 74.41 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Traceback (most recent call last):
File "/mnt/data2/finLLM/DAMO-ConvAI-main/PRO/train/main.py", line 45, in <module>
model = process_manager.train()
File "/mnt/data2/finLLM/DAMO-ConvAI-main/PRO/train/utils/process_manager.py", line 182, in train
model, optimizer, dataset_length = self.init_prepare_train(
File "/mnt/data2/finLLM/DAMO-ConvAI-main/PRO/train/utils/process_manager.py", line 165, in init_prepare_train
model, optimizer, _ = self.accelerator.prepare(
File "/mnt/data2/miniconda3/envs/pro/lib/python3.9/site-packages/accelerate/accelerator.py", line 1090, in prepare
result = self._prepare_deepspeed(*args)
File "/mnt/data2/miniconda3/envs/pro/lib/python3.9/site-packages/accelerate/accelerator.py", line 1368, in _prepare_deepspeed
engine, optimizer, _, lr_scheduler = deepspeed.initialize(**kwargs)
File "/mnt/data2/miniconda3/envs/pro/lib/python3.9/site-packages/deepspeed/__init__.py", line 125, in initialize
engine = DeepSpeedEngine(args=args,
File "/mnt/data2/miniconda3/envs/pro/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 336, in __init__
self._configure_optimizer(optimizer, model_parameters)
File "/mnt/data2/miniconda3/envs/pro/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1292, in _configure_optimizer
self.optimizer = self._configure_zero_optimizer(basic_optimizer)
File "/mnt/data2/miniconda3/envs/pro/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1542, in _configure_zero_optimizer
optimizer = DeepSpeedZeroOptimizer(
File "/mnt/data2/miniconda3/envs/pro/lib/python3.9/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 524, in __init__
self.initialize_optimizer_states()
File "/mnt/data2/miniconda3/envs/pro/lib/python3.9/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 649, in initialize_optimizer_states
self.optimizer.step()
File "/mnt/data2/miniconda3/envs/pro/lib/python3.9/site-packages/torch/optim/optimizer.py", line 280, in wrapper
out = func(*args, **kwargs)
File "/mnt/data2/miniconda3/envs/pro/lib/python3.9/site-packages/torch/optim/optimizer.py", line 33, in _use_grad
ret = func(self, *args, **kwargs)
File "/mnt/data2/miniconda3/envs/pro/lib/python3.9/site-packages/torch/optim/adamw.py", line 160, in step
self._init_group(
File "/mnt/data2/miniconda3/envs/pro/lib/python3.9/site-packages/torch/optim/adamw.py", line 118, in _init_group
state["exp_avg_sq"] = torch.zeros_like(
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 16.53 GiB (GPU 2; 79.15 GiB total capacity; 74.40 GiB already allocated; 4.18 GiB free; 74.41 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 92834 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 92835 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 92833) of binary: /mnt/data2/miniconda3/envs/pro/bin/python
Traceback (most recent call last):
File "/mnt/data2/miniconda3/envs/pro/bin/accelerate", line 8, in <module>
sys.exit(main())
File "/mnt/data2/miniconda3/envs/pro/lib/python3.9/site-packages/accelerate/commands/accelerate_cli.py", line 45, in main
args.func(args)
File "/mnt/data2/miniconda3/envs/pro/lib/python3.9/site-packages/accelerate/commands/launch.py", line 900, in launch_command
deepspeed_launcher(args)
File "/mnt/data2/miniconda3/envs/pro/lib/python3.9/site-packages/accelerate/commands/launch.py", line 643, in deepspeed_launcher
distrib_run.run(args)
File "/mnt/data2/miniconda3/envs/pro/lib/python3.9/site-packages/torch/distributed/run.py", line 785, in run
elastic_launch(
File "/mnt/data2/miniconda3/envs/pro/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/mnt/data2/miniconda3/envs/pro/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
main.py FAILED
------------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2024-03-20_02:15:40
host : oem
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 92833)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
hello,在 #69 中也已回复您。 经看这段log,OOM出现在accelerator的prepare处,此处是将LLM通过DeepSpeed分发到每张卡上。因看到您在command在设置num_process=2,即2张卡用于训练LLM,所以问题应不是出在训练过程上(因此与batch_size和block_size无关)。您可以尝试以下方法是否有效:
- 尝试用一个空白脚本,只包括使用accelerator.prepare来初始化您的LLM checkpoint,观察是否能复现OOM的报错
- 如果1仍会OOM,说明单卡装不下13B的模型(可能因为zero-2会在每张卡上都放置一份完整的模型参数,比如您设置的模型精度较高,即可能出现OOM),可以尝试改用zero-3和更低精度(可在process_manager.py的init中,直接在from_pretrained里添加dtype)
hello,在 #69 中也已回复您。 经看这段log,OOM出现在accelerator的prepare处,此处是将LLM通过DeepSpeed分发到每张卡上。因看到您在command在设置num_process=2,即2张卡用于训练LLM,所以问题应不是出在训练过程上(因此与batch_size和block_size无关)。您可以尝试以下方法是否有效:
- 尝试用一个空白脚本,只包括使用accelerator.prepare来初始化您的LLM checkpoint,观察是否能复现OOM的报错
- 如果1仍会OOM,说明单卡装不下13B的模型(可能因为zero-2会在每张卡上都放置一份完整的模型参数,比如您设置的模型精度较高,即可能出现OOM),可以尝试改用zero-3和更低精度(可在process_manager.py的init中,直接在from_pretrained里添加dtype)
您好,感谢回复。 我尝试您说的用accelerator.prepare来初始化13B的模型,代码如下(用gpt生成的,不知是否有误):
import os
import time
os.environ["CUDA_VISIBLE_DEVICES"] = "0"
import torch
from accelerate import Accelerator
from transformers import AutoModelForCausalLM, AutoTokenizer
def main():
# Initialize accelerator
accelerator = Accelerator()
path = "/mnt/data2/finLLM/models/tigerbot-13b-base"
# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(path)
# Load model
model = AutoModelForCausalLM.from_pretrained(path)
# Prepare model with accelerator
model, _, = accelerator.prepare(model, tokenizer) # Removed the unnecessary unpacking
# Free up CUDA memory
torch.cuda.empty_cache()
# Pause execution for 5 minutes
print("Model loaded successfully. Pausing execution for 5 minutes.")
time.sleep(300) # 300 seconds = 5 minutes
if __name__ == "__main__":
main()
GPU不会爆OOM,单卡A800占大概50G,另外模型的精度如下: "torch_dtype": "bfloat16", 这个精度似乎是比较常见的?我觉得不是模型过大或者精度较高的原因,之前有在别人的开源训练项目上有用这个模型去训练,也是用zero-2+16位精度,不会出现OOM的问题。 另外,不知道库的版本会不会影响? 我在部署本项目时,有遇到bug,更新了部分库的版本:
Package Version
----------------------- -----------
absl-py 2.1.0
accelerate 0.17.1
aiohttp 3.9.3
aiosignal 1.3.1
annotated-types 0.6.0
async-timeout 4.0.3
attrs 23.2.0
certifi 2024.2.2
charset-normalizer 2.0.4
click 8.1.7
datasets 2.18.0
deepspeed 0.8.1
dill 0.3.6
evaluate 0.4.1
filelock 3.13.1
frozenlist 1.4.1
fsspec 2024.2.0
grpcio 1.62.1
hjson 3.1.0
huggingface-hub 0.21.4
idna 3.4
importlib_metadata 7.0.2
joblib 1.3.2
Markdown 3.6
MarkupSafe 2.1.5
mkl-fft 1.3.8
mkl-random 1.2.4
mkl-service 2.4.0
multidict 6.0.5
multiprocess 0.70.14
ninja 1.11.1.1
nltk 3.8.1
numpy 1.22.2
packaging 24.0
pandas 2.0.3
peft 0.3.0
pillow 10.2.0
pip 23.3.1
protobuf 5.26.0
psutil 5.9.8
py-cpuinfo 9.0.0
pyarrow 15.0.2
pyarrow-hotfix 0.6
pydantic 1.10.9
pydantic_core 2.16.3
python-dateutil 2.9.0.post0
pytz 2024.1
PyYAML 6.0.1
regex 2023.12.25
requests 2.31.0
responses 0.18.0
rouge-score 0.1.2
scipy 1.11.1
sentencepiece 0.2.0
setuptools 68.2.2
six 1.16.0
tensorboard 2.16.2
tensorboard-data-server 0.7.2
tokenizers 0.13.3
torch 1.13.1
torchaudio 0.13.1
torchvision 0.14.1
tqdm 4.64.1
transformers 4.28.1
typing_extensions 4.9.0
tzdata 2024.1
urllib3 2.1.0
Werkzeug 3.0.1
wheel 0.41.2
xxhash 3.4.1
yarl 1.9.4
zipp 3.18.1
hello,在 #69 中也已回复您。 经看这段log,OOM出现在accelerator的prepare处,此处是将LLM通过DeepSpeed分发到每张卡上。因看到您在command在设置num_process=2,即2张卡用于训练LLM,所以问题应不是出在训练过程上(因此与batch_size和block_size无关)。您可以尝试以下方法是否有效:
- 尝试用一个空白脚本,只包括使用accelerator.prepare来初始化您的LLM checkpoint,观察是否能复现OOM的报错
- 如果1仍会OOM,说明单卡装不下13B的模型(可能因为zero-2会在每张卡上都放置一份完整的模型参数,比如您设置的模型精度较高,即可能出现OOM),可以尝试改用zero-3和更低精度(可在process_manager.py的init中,直接在from_pretrained里添加dtype)
您好,感谢回复。 我尝试您说的用accelerator.prepare来初始化13B的模型,代码如下(用gpt生成的,不知是否有误):
import os import time os.environ["CUDA_VISIBLE_DEVICES"] = "0" import torch from accelerate import Accelerator from transformers import AutoModelForCausalLM, AutoTokenizer def main(): # Initialize accelerator accelerator = Accelerator() path = "/mnt/data2/finLLM/models/tigerbot-13b-base" # Load tokenizer tokenizer = AutoTokenizer.from_pretrained(path) # Load model model = AutoModelForCausalLM.from_pretrained(path) # Prepare model with accelerator model, _, = accelerator.prepare(model, tokenizer) # Removed the unnecessary unpacking # Free up CUDA memory torch.cuda.empty_cache() # Pause execution for 5 minutes print("Model loaded successfully. Pausing execution for 5 minutes.") time.sleep(300) # 300 seconds = 5 minutes if __name__ == "__main__": main()GPU不会爆OOM,单卡A800占大概50G,另外模型的精度如下: "torch_dtype": "bfloat16", 这个精度似乎是比较常见的?我觉得不是模型过大或者精度较高的原因,之前有在别人的开源训练项目上有用这个模型去训练,也是用zero-2+16位精度,不会出现OOM的问题。 另外,不知道库的版本会不会影响? 我在部署本项目时,有遇到bug,更新了部分库的版本:
Package Version ----------------------- ----------- absl-py 2.1.0 accelerate 0.17.1 aiohttp 3.9.3 aiosignal 1.3.1 annotated-types 0.6.0 async-timeout 4.0.3 attrs 23.2.0 certifi 2024.2.2 charset-normalizer 2.0.4 click 8.1.7 datasets 2.18.0 deepspeed 0.8.1 dill 0.3.6 evaluate 0.4.1 filelock 3.13.1 frozenlist 1.4.1 fsspec 2024.2.0 grpcio 1.62.1 hjson 3.1.0 huggingface-hub 0.21.4 idna 3.4 importlib_metadata 7.0.2 joblib 1.3.2 Markdown 3.6 MarkupSafe 2.1.5 mkl-fft 1.3.8 mkl-random 1.2.4 mkl-service 2.4.0 multidict 6.0.5 multiprocess 0.70.14 ninja 1.11.1.1 nltk 3.8.1 numpy 1.22.2 packaging 24.0 pandas 2.0.3 peft 0.3.0 pillow 10.2.0 pip 23.3.1 protobuf 5.26.0 psutil 5.9.8 py-cpuinfo 9.0.0 pyarrow 15.0.2 pyarrow-hotfix 0.6 pydantic 1.10.9 pydantic_core 2.16.3 python-dateutil 2.9.0.post0 pytz 2024.1 PyYAML 6.0.1 regex 2023.12.25 requests 2.31.0 responses 0.18.0 rouge-score 0.1.2 scipy 1.11.1 sentencepiece 0.2.0 setuptools 68.2.2 six 1.16.0 tensorboard 2.16.2 tensorboard-data-server 0.7.2 tokenizers 0.13.3 torch 1.13.1 torchaudio 0.13.1 torchvision 0.14.1 tqdm 4.64.1 transformers 4.28.1 typing_extensions 4.9.0 tzdata 2024.1 urllib3 2.1.0 Werkzeug 3.0.1 wheel 0.41.2 xxhash 3.4.1 yarl 1.9.4 zipp 3.18.1
您好,感谢您提供的详细信息。您回复的这段代码在运行时应该没有使用DeepSpeed,且指定1张显卡,这样会与直接不使用accelerate在效果上没有区别,所以和PRO的运行环境还不完全一样(因为使用DeepSpeed的话,应该会要求prepare时必须传入dataloader,这也是我们在代码里设置了一个placeholder_dataloader的原因)。您可尝试在PRO的代码process_manager.py中,于accelerator.prepare后直接添加如下代码:
if accelerator. wait_for_everyone():
exit()
并观察运行至此时是否还会OOM,据此再考虑下一步debug计划。
如果bf16的显存占用是50G左右,确实不应在prepare阶段就OOM。您或可考虑在AutoModelForCausalLM.from_pretrained()中直接指定torch_dtype=torch.bfloat16试一下?
hello,在 #69 中也已回复您。 经看这段log,OOM出现在accelerator的prepare处,此处是将LLM通过DeepSpeed分发到每张卡上。因看到您在command在设置num_process=2,即2张卡用于训练LLM,所以问题应不是出在训练过程上(因此与batch_size和block_size无关)。您可以尝试以下方法是否有效:
- 尝试用一个空白脚本,只包括使用accelerator.prepare来初始化您的LLM checkpoint,观察是否能复现OOM的报错
- 如果1仍会OOM,说明单卡装不下13B的模型(可能因为zero-2会在每张卡上都放置一份完整的模型参数,比如您设置的模型精度较高,即可能出现OOM),可以尝试改用zero-3和更低精度(可在process_manager.py的init中,直接在from_pretrained里添加dtype)
您好,感谢回复。 我尝试您说的用accelerator.prepare来初始化13B的模型,代码如下(用gpt生成的,不知是否有误):
import os import time os.environ["CUDA_VISIBLE_DEVICES"] = "0" import torch from accelerate import Accelerator from transformers import AutoModelForCausalLM, AutoTokenizer def main(): # Initialize accelerator accelerator = Accelerator() path = "/mnt/data2/finLLM/models/tigerbot-13b-base" # Load tokenizer tokenizer = AutoTokenizer.from_pretrained(path) # Load model model = AutoModelForCausalLM.from_pretrained(path) # Prepare model with accelerator model, _, = accelerator.prepare(model, tokenizer) # Removed the unnecessary unpacking # Free up CUDA memory torch.cuda.empty_cache() # Pause execution for 5 minutes print("Model loaded successfully. Pausing execution for 5 minutes.") time.sleep(300) # 300 seconds = 5 minutes if __name__ == "__main__": main()GPU不会爆OOM,单卡A800占大概50G,另外模型的精度如下: "torch_dtype": "bfloat16", 这个精度似乎是比较常见的?我觉得不是模型过大或者精度较高的原因,之前有在别人的开源训练项目上有用这个模型去训练,也是用zero-2+16位精度,不会出现OOM的问题。 另外,不知道库的版本会不会影响? 我在部署本项目时,有遇到bug,更新了部分库的版本:
Package Version ----------------------- ----------- absl-py 2.1.0 accelerate 0.17.1 aiohttp 3.9.3 aiosignal 1.3.1 annotated-types 0.6.0 async-timeout 4.0.3 attrs 23.2.0 certifi 2024.2.2 charset-normalizer 2.0.4 click 8.1.7 datasets 2.18.0 deepspeed 0.8.1 dill 0.3.6 evaluate 0.4.1 filelock 3.13.1 frozenlist 1.4.1 fsspec 2024.2.0 grpcio 1.62.1 hjson 3.1.0 huggingface-hub 0.21.4 idna 3.4 importlib_metadata 7.0.2 joblib 1.3.2 Markdown 3.6 MarkupSafe 2.1.5 mkl-fft 1.3.8 mkl-random 1.2.4 mkl-service 2.4.0 multidict 6.0.5 multiprocess 0.70.14 ninja 1.11.1.1 nltk 3.8.1 numpy 1.22.2 packaging 24.0 pandas 2.0.3 peft 0.3.0 pillow 10.2.0 pip 23.3.1 protobuf 5.26.0 psutil 5.9.8 py-cpuinfo 9.0.0 pyarrow 15.0.2 pyarrow-hotfix 0.6 pydantic 1.10.9 pydantic_core 2.16.3 python-dateutil 2.9.0.post0 pytz 2024.1 PyYAML 6.0.1 regex 2023.12.25 requests 2.31.0 responses 0.18.0 rouge-score 0.1.2 scipy 1.11.1 sentencepiece 0.2.0 setuptools 68.2.2 six 1.16.0 tensorboard 2.16.2 tensorboard-data-server 0.7.2 tokenizers 0.13.3 torch 1.13.1 torchaudio 0.13.1 torchvision 0.14.1 tqdm 4.64.1 transformers 4.28.1 typing_extensions 4.9.0 tzdata 2024.1 urllib3 2.1.0 Werkzeug 3.0.1 wheel 0.41.2 xxhash 3.4.1 yarl 1.9.4 zipp 3.18.1您好,感谢您提供的详细信息。您回复的这段代码在运行时应该没有使用DeepSpeed,且指定1张显卡,这样会与直接不使用accelerate在效果上没有区别,所以和PRO的运行环境还不完全一样(因为使用DeepSpeed的话,应该会要求prepare时必须传入dataloader,这也是我们在代码里设置了一个placeholder_dataloader的原因)。您可尝试在PRO的代码process_manager.py中,于accelerator.prepare后直接添加如下代码:
if accelerator. wait_for_everyone(): exit()并观察运行至此时是否还会OOM,据此再考虑下一步debug计划。
如果bf16的显存占用是50G左右,确实不应在prepare阶段就OOM。您或可考虑在AutoModelForCausalLM.from_pretrained()中直接指定torch_dtype=torch.bfloat16试一下?
对训练代码不太懂,感谢您的建议,尝试结果如下: 1、直接在process_manager.py中添加您提供的代码:
model, optimizer, _ = self.accelerator.prepare(
self.model, optimizer, placeholder_dataloader
)
if self.accelerator.wait_for_everyone():
print("[info] self.accelerator.wait_for_everyone() True")
exit()
在3个GPU上运行,结果3个GPU都在prepare()爆了
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 16.53 GiB (GPU 1; 79.15 GiB total capacity; 74.40 GiB already allocated; 4.02 GiB free; 74.41 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
state['exp_avg_sq'] = torch.zeros_like(p, memory_format=torch.preserve_format)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 16.53 GiB (GPU 0; 79.15 GiB total capacity; 74.40 GiB already allocated; 3.99 GiB free; 74.41 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
state['exp_avg_sq'] = torch.zeros_like(p, memory_format=torch.preserve_format)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 16.53 GiB (GPU 2; 79.15 GiB total capacity; 74.40 GiB already allocated; 4.03 GiB free; 74.41 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
2、手动设置精度: self.model = AutoModelForCausalLM.from_pretrained(self.model_path,config=self.model_config, torch_dtype=torch.bfloat16) 3张卡还是爆了:
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 16.53 GiB (GPU 2; 79.15 GiB total capacity; 74.40 GiB already allocated; 4.03 GiB free; 74.41 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 16.53 GiB (GPU 1; 79.15 GiB total capacity; 74.40 GiB already allocated; 4.02 GiB free; 74.41 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 16.53 GiB (GPU 0; 79.15 GiB total capacity; 74.40 GiB already allocated; 3.99 GiB free; 74.41 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
超出的显存大小跟设置前一样,应该能排除精度这一因素。 3、关闭do_validation & 更新deepspeed 另外看到#69有提到说关闭do_validation,爆了,将deepspeed改为zero3,显示版本过低,更新deepspeed: Found existing installation: deepspeed 0.8.1 Uninstalling deepspeed-0.8.1: Successfully uninstalled deepspeed-0.8.1 Successfully installed deepspeed-0.14.0 pynvml-11.5.0 更新了之后初始化不会报错了,换回zero2也不会,看样子是deepspeed版本的问题,这是为什么? 但是在train loop爆了,将gpu增加到8张,还是爆了:
0%| | 0/5026 [00:00<?, ?it/s]
Epoch 0 starts
Load training data from ../data/hh_train_len2/train.json
0%| | 1/5026 [01:13<102:21:28, 73.33s/it]Traceback (most recent call last):
File "/share/home/wenqingchen/finLLM/DAMO-ConvAI-main/PRO/train/main.py", line 45, in <module>
model = process_manager.train()
File "/share/home/wenqingchen/finLLM/DAMO-ConvAI-main/PRO/train/utils/process_manager.py", line 261, in train
self.compute_loss(model, batch, print_loss)
File "/share/home/wenqingchen/finLLM/DAMO-ConvAI-main/PRO/train/utils/process_manager.py", line 111, in compute_loss
self.accelerator.backward(total_loss)
File "/share/home/wenqingchen/miniconda3/envs/pro/lib/python3.9/site-packages/accelerate/accelerator.py", line 1630, in backward
self.deepspeed_engine_wrapped.backward(loss, **kwargs)
File "/share/home/wenqingchen/miniconda3/envs/pro/lib/python3.9/site-packages/accelerate/utils/deepspeed.py", line 167, in backward
self.engine.backward(loss, **kwargs)
File "/share/home/wenqingchen/miniconda3/envs/pro/lib/python3.9/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/share/home/wenqingchen/miniconda3/envs/pro/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1976, in backward
self.optimizer.backward(loss, retain_graph=retain_graph)
File "/share/home/wenqingchen/miniconda3/envs/pro/lib/python3.9/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 2051, in backward
self.loss_scaler.backward(loss.float(), retain_graph=retain_graph)
File "/share/home/wenqingchen/miniconda3/envs/pro/lib/python3.9/site-packages/deepspeed/runtime/fp16/loss_scaler.py", line 63, in backward
scaled_loss.backward(retain_graph=retain_graph)
File "/share/home/wenqingchen/miniconda3/envs/pro/lib/python3.9/site-packages/torch/_tensor.py", line 488, in backward
torch.autograd.backward(
File "/share/home/wenqingchen/miniconda3/envs/pro/lib/python3.9/site-packages/torch/autograd/__init__.py", line 197, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 592.00 MiB (GPU 4; 79.15 GiB total capacity; 76.09 GiB already allocated; 269.31 MiB free; 78.25 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
zero3配置如下:
compute_environment: LOCAL_MACHINE
deepspeed_config:
gradient_accumulation_steps: 8
gradient_clipping: 1.0
offload_optimizer_device: none
offload_param_device: none
zero3_init_flag: true
zero3_save_16bit_model: true
zero_stage: 3 # 改为使用Zero3
distributed_type: DEEPSPEED
downcast_bf16: 'no'
dynamo_backend: 'NO'
fsdp_config: {}
machine_rank: 0
main_training_function: main
megatron_lm_config: {}
mixed_precision: bf16
num_machines: 1
num_processes: 8
rdzv_backend: static
same_network: true
use_cpu: false
原本是上了cpu offload的,但是deepspeed报错强制要求使用官方的优化器,而不是pro代码里的torch优化器,但我不知道怎么改... 不过话说回来,bs=1,block_size 512,模型是13B,这个配置对显存要求应该不高呀,是不是还有什么库的版本不对?
hello,在 #69 中也已回复您。 经看这段log,OOM出现在accelerator的prepare处,此处是将LLM通过DeepSpeed分发到每张卡上。因看到您在command在设置num_process=2,即2张卡用于训练LLM,所以问题应不是出在训练过程上(因此与batch_size和block_size无关)。您可以尝试以下方法是否有效:
- 尝试用一个空白脚本,只包括使用accelerator.prepare来初始化您的LLM checkpoint,观察是否能复现OOM的报错
- 如果1仍会OOM,说明单卡装不下13B的模型(可能因为zero-2会在每张卡上都放置一份完整的模型参数,比如您设置的模型精度较高,即可能出现OOM),可以尝试改用zero-3和更低精度(可在process_manager.py的init中,直接在from_pretrained里添加dtype)
您好,感谢回复。 我尝试您说的用accelerator.prepare来初始化13B的模型,代码如下(用gpt生成的,不知是否有误):
import os import time os.environ["CUDA_VISIBLE_DEVICES"] = "0" import torch from accelerate import Accelerator from transformers import AutoModelForCausalLM, AutoTokenizer def main(): # Initialize accelerator accelerator = Accelerator() path = "/mnt/data2/finLLM/models/tigerbot-13b-base" # Load tokenizer tokenizer = AutoTokenizer.from_pretrained(path) # Load model model = AutoModelForCausalLM.from_pretrained(path) # Prepare model with accelerator model, _, = accelerator.prepare(model, tokenizer) # Removed the unnecessary unpacking # Free up CUDA memory torch.cuda.empty_cache() # Pause execution for 5 minutes print("Model loaded successfully. Pausing execution for 5 minutes.") time.sleep(300) # 300 seconds = 5 minutes if __name__ == "__main__": main()GPU不会爆OOM,单卡A800占大概50G,另外模型的精度如下: "torch_dtype": "bfloat16", 这个精度似乎是比较常见的?我觉得不是模型过大或者精度较高的原因,之前有在别人的开源训练项目上有用这个模型去训练,也是用zero-2+16位精度,不会出现OOM的问题。 另外,不知道库的版本会不会影响? 我在部署本项目时,有遇到bug,更新了部分库的版本:
Package Version ----------------------- ----------- absl-py 2.1.0 accelerate 0.17.1 aiohttp 3.9.3 aiosignal 1.3.1 annotated-types 0.6.0 async-timeout 4.0.3 attrs 23.2.0 certifi 2024.2.2 charset-normalizer 2.0.4 click 8.1.7 datasets 2.18.0 deepspeed 0.8.1 dill 0.3.6 evaluate 0.4.1 filelock 3.13.1 frozenlist 1.4.1 fsspec 2024.2.0 grpcio 1.62.1 hjson 3.1.0 huggingface-hub 0.21.4 idna 3.4 importlib_metadata 7.0.2 joblib 1.3.2 Markdown 3.6 MarkupSafe 2.1.5 mkl-fft 1.3.8 mkl-random 1.2.4 mkl-service 2.4.0 multidict 6.0.5 multiprocess 0.70.14 ninja 1.11.1.1 nltk 3.8.1 numpy 1.22.2 packaging 24.0 pandas 2.0.3 peft 0.3.0 pillow 10.2.0 pip 23.3.1 protobuf 5.26.0 psutil 5.9.8 py-cpuinfo 9.0.0 pyarrow 15.0.2 pyarrow-hotfix 0.6 pydantic 1.10.9 pydantic_core 2.16.3 python-dateutil 2.9.0.post0 pytz 2024.1 PyYAML 6.0.1 regex 2023.12.25 requests 2.31.0 responses 0.18.0 rouge-score 0.1.2 scipy 1.11.1 sentencepiece 0.2.0 setuptools 68.2.2 six 1.16.0 tensorboard 2.16.2 tensorboard-data-server 0.7.2 tokenizers 0.13.3 torch 1.13.1 torchaudio 0.13.1 torchvision 0.14.1 tqdm 4.64.1 transformers 4.28.1 typing_extensions 4.9.0 tzdata 2024.1 urllib3 2.1.0 Werkzeug 3.0.1 wheel 0.41.2 xxhash 3.4.1 yarl 1.9.4 zipp 3.18.1您好,感谢您提供的详细信息。您回复的这段代码在运行时应该没有使用DeepSpeed,且指定1张显卡,这样会与直接不使用accelerate在效果上没有区别,所以和PRO的运行环境还不完全一样(因为使用DeepSpeed的话,应该会要求prepare时必须传入dataloader,这也是我们在代码里设置了一个placeholder_dataloader的原因)。您可尝试在PRO的代码process_manager.py中,于accelerator.prepare后直接添加如下代码:
if accelerator. wait_for_everyone(): exit()并观察运行至此时是否还会OOM,据此再考虑下一步debug计划。 如果bf16的显存占用是50G左右,确实不应在prepare阶段就OOM。您或可考虑在AutoModelForCausalLM.from_pretrained()中直接指定torch_dtype=torch.bfloat16试一下?
对训练代码不太懂,感谢您的建议,尝试结果如下: 1、直接在process_manager.py中添加您提供的代码:
model, optimizer, _ = self.accelerator.prepare( self.model, optimizer, placeholder_dataloader ) if self.accelerator.wait_for_everyone(): print("[info] self.accelerator.wait_for_everyone() True") exit()在3个GPU上运行,结果3个GPU都在prepare()爆了
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 16.53 GiB (GPU 1; 79.15 GiB total capacity; 74.40 GiB already allocated; 4.02 GiB free; 74.41 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF state['exp_avg_sq'] = torch.zeros_like(p, memory_format=torch.preserve_format) torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 16.53 GiB (GPU 0; 79.15 GiB total capacity; 74.40 GiB already allocated; 3.99 GiB free; 74.41 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF state['exp_avg_sq'] = torch.zeros_like(p, memory_format=torch.preserve_format) torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 16.53 GiB (GPU 2; 79.15 GiB total capacity; 74.40 GiB already allocated; 4.03 GiB free; 74.41 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF2、手动设置精度: self.model = AutoModelForCausalLM.from_pretrained(self.model_path,config=self.model_config, torch_dtype=torch.bfloat16) 3张卡还是爆了:
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 16.53 GiB (GPU 2; 79.15 GiB total capacity; 74.40 GiB already allocated; 4.03 GiB free; 74.41 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 16.53 GiB (GPU 1; 79.15 GiB total capacity; 74.40 GiB already allocated; 4.02 GiB free; 74.41 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 16.53 GiB (GPU 0; 79.15 GiB total capacity; 74.40 GiB already allocated; 3.99 GiB free; 74.41 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF超出的显存大小跟设置前一样,应该能排除精度这一因素。 3、关闭do_validation & 更新deepspeed 另外看到#69有提到说关闭do_validation,爆了,将deepspeed改为zero3,显示版本过低,更新deepspeed: Found existing installation: deepspeed 0.8.1 Uninstalling deepspeed-0.8.1: Successfully uninstalled deepspeed-0.8.1 Successfully installed deepspeed-0.14.0 pynvml-11.5.0 更新了之后初始化不会报错了,换回zero2也不会,看样子是deepspeed版本的问题,这是为什么? 但是在train loop爆了,将gpu增加到8张,还是爆了:
0%| | 0/5026 [00:00<?, ?it/s] Epoch 0 starts Load training data from ../data/hh_train_len2/train.json 0%| | 1/5026 [01:13<102:21:28, 73.33s/it]Traceback (most recent call last): File "/share/home/wenqingchen/finLLM/DAMO-ConvAI-main/PRO/train/main.py", line 45, in <module> model = process_manager.train() File "/share/home/wenqingchen/finLLM/DAMO-ConvAI-main/PRO/train/utils/process_manager.py", line 261, in train self.compute_loss(model, batch, print_loss) File "/share/home/wenqingchen/finLLM/DAMO-ConvAI-main/PRO/train/utils/process_manager.py", line 111, in compute_loss self.accelerator.backward(total_loss) File "/share/home/wenqingchen/miniconda3/envs/pro/lib/python3.9/site-packages/accelerate/accelerator.py", line 1630, in backward self.deepspeed_engine_wrapped.backward(loss, **kwargs) File "/share/home/wenqingchen/miniconda3/envs/pro/lib/python3.9/site-packages/accelerate/utils/deepspeed.py", line 167, in backward self.engine.backward(loss, **kwargs) File "/share/home/wenqingchen/miniconda3/envs/pro/lib/python3.9/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn ret_val = func(*args, **kwargs) File "/share/home/wenqingchen/miniconda3/envs/pro/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1976, in backward self.optimizer.backward(loss, retain_graph=retain_graph) File "/share/home/wenqingchen/miniconda3/envs/pro/lib/python3.9/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 2051, in backward self.loss_scaler.backward(loss.float(), retain_graph=retain_graph) File "/share/home/wenqingchen/miniconda3/envs/pro/lib/python3.9/site-packages/deepspeed/runtime/fp16/loss_scaler.py", line 63, in backward scaled_loss.backward(retain_graph=retain_graph) File "/share/home/wenqingchen/miniconda3/envs/pro/lib/python3.9/site-packages/torch/_tensor.py", line 488, in backward torch.autograd.backward( File "/share/home/wenqingchen/miniconda3/envs/pro/lib/python3.9/site-packages/torch/autograd/__init__.py", line 197, in backward Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 592.00 MiB (GPU 4; 79.15 GiB total capacity; 76.09 GiB already allocated; 269.31 MiB free; 78.25 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONFzero3配置如下:
compute_environment: LOCAL_MACHINE deepspeed_config: gradient_accumulation_steps: 8 gradient_clipping: 1.0 offload_optimizer_device: none offload_param_device: none zero3_init_flag: true zero3_save_16bit_model: true zero_stage: 3 # 改为使用Zero3 distributed_type: DEEPSPEED downcast_bf16: 'no' dynamo_backend: 'NO' fsdp_config: {} machine_rank: 0 main_training_function: main megatron_lm_config: {} mixed_precision: bf16 num_machines: 1 num_processes: 8 rdzv_backend: static same_network: true use_cpu: false原本是上了cpu offload的,但是deepspeed报错强制要求使用官方的优化器,而不是pro代码里的torch优化器,但我不知道怎么改... 不过话说回来,bs=1,block_size 512,模型是13B,这个配置对显存要求应该不高呀,是不是还有什么库的版本不对?
感谢你的详细描述,以下是逐条回复:
- 优化器定义在
process_manager.py的153行。我们用的版本正是一级目录下requirements.txt里描述的版本。如果使用其他版本需要修改优化器可以直接在这里改,但我也不太了解这一特性(比如使用其他优化器)是否受accelerate支持。(现在使用的AdamW确实也比较占显存) - 代码实现里没有直接用deepspeed的高级特性,因此应该对package版本不敏感,只要升级后能成功运行就可以。
- 对您提到的
bs=1,block_size 512,模型是13B这一设置,因我目前没有能使用的8卡机器,没法尝试复现。可以考虑设置sh脚本里ranking_len=1试一下是否能正常训练(等同于SFT)。我们之前有一种简单的换算,比如,bs=1, ranking_len=2和bs=2, ranking_len=1所需资源应该差不多。 - 很抱歉,在 #69 中我对
do_validation的描述不够清晰。这个选项本身和显存无关,而是开启之后会在可见显卡中选择最后一张放置reward model用于validation。因此,假定您在使用一台8卡机器,关闭do_validation后需要将sh脚本的--num_processes 7也修改一下,比如修改为8。若仅在ds_config.yaml中修改是无效的,因为在命令中直接指定的值会更优先(当然在关掉do_validation后,于命令中直接删掉--num_processes 7,之后通过ds_config.yaml控制用卡数量也可以)。 - 对您提到升级deepspeed后能顺利通过prepare,这个问题的原因我也不太了解。我注意到您使用的
tigerbot-13b-base是基于transformers 4.31.0的,而在后续版本中transformers确实修改了llama的实现。因此,您或可尝试将所有package都升级至比较新的版本,如上所述,PRO的代码实现应该是对具体版本不敏感的,只要package之间互相能兼容就可以。
hello,在 #69 中也已回复您。 经看这段log,OOM出现在accelerator的prepare处,此处是将LLM通过DeepSpeed分发到每张卡上。因看到您在command在设置num_process=2,即2张卡用于训练LLM,所以问题应不是出在训练过程上(因此与batch_size和block_size无关)。您可以尝试以下方法是否有效:
- 尝试用一个空白脚本,只包括使用accelerator.prepare来初始化您的LLM checkpoint,观察是否能复现OOM的报错
- 如果1仍会OOM,说明单卡装不下13B的模型(可能因为zero-2会在每张卡上都放置一份完整的模型参数,比如您设置的模型精度较高,即可能出现OOM),可以尝试改用zero-3和更低精度(可在process_manager.py的init中,直接在from_pretrained里添加dtype)
您好,感谢回复。 我尝试您说的用accelerator.prepare来初始化13B的模型,代码如下(用gpt生成的,不知是否有误):
import os import time os.environ["CUDA_VISIBLE_DEVICES"] = "0" import torch from accelerate import Accelerator from transformers import AutoModelForCausalLM, AutoTokenizer def main(): # Initialize accelerator accelerator = Accelerator() path = "/mnt/data2/finLLM/models/tigerbot-13b-base" # Load tokenizer tokenizer = AutoTokenizer.from_pretrained(path) # Load model model = AutoModelForCausalLM.from_pretrained(path) # Prepare model with accelerator model, _, = accelerator.prepare(model, tokenizer) # Removed the unnecessary unpacking # Free up CUDA memory torch.cuda.empty_cache() # Pause execution for 5 minutes print("Model loaded successfully. Pausing execution for 5 minutes.") time.sleep(300) # 300 seconds = 5 minutes if __name__ == "__main__": main()GPU不会爆OOM,单卡A800占大概50G,另外模型的精度如下: "torch_dtype": "bfloat16", 这个精度似乎是比较常见的?我觉得不是模型过大或者精度较高的原因,之前有在别人的开源训练项目上有用这个模型去训练,也是用zero-2+16位精度,不会出现OOM的问题。 另外,不知道库的版本会不会影响? 我在部署本项目时,有遇到bug,更新了部分库的版本:
Package Version ----------------------- ----------- absl-py 2.1.0 accelerate 0.17.1 aiohttp 3.9.3 aiosignal 1.3.1 annotated-types 0.6.0 async-timeout 4.0.3 attrs 23.2.0 certifi 2024.2.2 charset-normalizer 2.0.4 click 8.1.7 datasets 2.18.0 deepspeed 0.8.1 dill 0.3.6 evaluate 0.4.1 filelock 3.13.1 frozenlist 1.4.1 fsspec 2024.2.0 grpcio 1.62.1 hjson 3.1.0 huggingface-hub 0.21.4 idna 3.4 importlib_metadata 7.0.2 joblib 1.3.2 Markdown 3.6 MarkupSafe 2.1.5 mkl-fft 1.3.8 mkl-random 1.2.4 mkl-service 2.4.0 multidict 6.0.5 multiprocess 0.70.14 ninja 1.11.1.1 nltk 3.8.1 numpy 1.22.2 packaging 24.0 pandas 2.0.3 peft 0.3.0 pillow 10.2.0 pip 23.3.1 protobuf 5.26.0 psutil 5.9.8 py-cpuinfo 9.0.0 pyarrow 15.0.2 pyarrow-hotfix 0.6 pydantic 1.10.9 pydantic_core 2.16.3 python-dateutil 2.9.0.post0 pytz 2024.1 PyYAML 6.0.1 regex 2023.12.25 requests 2.31.0 responses 0.18.0 rouge-score 0.1.2 scipy 1.11.1 sentencepiece 0.2.0 setuptools 68.2.2 six 1.16.0 tensorboard 2.16.2 tensorboard-data-server 0.7.2 tokenizers 0.13.3 torch 1.13.1 torchaudio 0.13.1 torchvision 0.14.1 tqdm 4.64.1 transformers 4.28.1 typing_extensions 4.9.0 tzdata 2024.1 urllib3 2.1.0 Werkzeug 3.0.1 wheel 0.41.2 xxhash 3.4.1 yarl 1.9.4 zipp 3.18.1您好,感谢您提供的详细信息。您回复的这段代码在运行时应该没有使用DeepSpeed,且指定1张显卡,这样会与直接不使用accelerate在效果上没有区别,所以和PRO的运行环境还不完全一样(因为使用DeepSpeed的话,应该会要求prepare时必须传入dataloader,这也是我们在代码里设置了一个placeholder_dataloader的原因)。您可尝试在PRO的代码process_manager.py中,于accelerator.prepare后直接添加如下代码:
if accelerator. wait_for_everyone(): exit()并观察运行至此时是否还会OOM,据此再考虑下一步debug计划。 如果bf16的显存占用是50G左右,确实不应在prepare阶段就OOM。您或可考虑在AutoModelForCausalLM.from_pretrained()中直接指定torch_dtype=torch.bfloat16试一下?
对训练代码不太懂,感谢您的建议,尝试结果如下: 1、直接在process_manager.py中添加您提供的代码:
model, optimizer, _ = self.accelerator.prepare( self.model, optimizer, placeholder_dataloader ) if self.accelerator.wait_for_everyone(): print("[info] self.accelerator.wait_for_everyone() True") exit()在3个GPU上运行,结果3个GPU都在prepare()爆了
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 16.53 GiB (GPU 1; 79.15 GiB total capacity; 74.40 GiB already allocated; 4.02 GiB free; 74.41 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF state['exp_avg_sq'] = torch.zeros_like(p, memory_format=torch.preserve_format) torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 16.53 GiB (GPU 0; 79.15 GiB total capacity; 74.40 GiB already allocated; 3.99 GiB free; 74.41 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF state['exp_avg_sq'] = torch.zeros_like(p, memory_format=torch.preserve_format) torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 16.53 GiB (GPU 2; 79.15 GiB total capacity; 74.40 GiB already allocated; 4.03 GiB free; 74.41 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF2、手动设置精度: self.model = AutoModelForCausalLM.from_pretrained(self.model_path,config=self.model_config, torch_dtype=torch.bfloat16) 3张卡还是爆了:
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 16.53 GiB (GPU 2; 79.15 GiB total capacity; 74.40 GiB already allocated; 4.03 GiB free; 74.41 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 16.53 GiB (GPU 1; 79.15 GiB total capacity; 74.40 GiB already allocated; 4.02 GiB free; 74.41 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 16.53 GiB (GPU 0; 79.15 GiB total capacity; 74.40 GiB already allocated; 3.99 GiB free; 74.41 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF超出的显存大小跟设置前一样,应该能排除精度这一因素。 3、关闭do_validation & 更新deepspeed 另外看到#69有提到说关闭do_validation,爆了,将deepspeed改为zero3,显示版本过低,更新deepspeed: Found existing installation: deepspeed 0.8.1 Uninstalling deepspeed-0.8.1: Successfully uninstalled deepspeed-0.8.1 Successfully installed deepspeed-0.14.0 pynvml-11.5.0 更新了之后初始化不会报错了,换回zero2也不会,看样子是deepspeed版本的问题,这是为什么? 但是在train loop爆了,将gpu增加到8张,还是爆了:
0%| | 0/5026 [00:00<?, ?it/s] Epoch 0 starts Load training data from ../data/hh_train_len2/train.json 0%| | 1/5026 [01:13<102:21:28, 73.33s/it]Traceback (most recent call last): File "/share/home/wenqingchen/finLLM/DAMO-ConvAI-main/PRO/train/main.py", line 45, in <module> model = process_manager.train() File "/share/home/wenqingchen/finLLM/DAMO-ConvAI-main/PRO/train/utils/process_manager.py", line 261, in train self.compute_loss(model, batch, print_loss) File "/share/home/wenqingchen/finLLM/DAMO-ConvAI-main/PRO/train/utils/process_manager.py", line 111, in compute_loss self.accelerator.backward(total_loss) File "/share/home/wenqingchen/miniconda3/envs/pro/lib/python3.9/site-packages/accelerate/accelerator.py", line 1630, in backward self.deepspeed_engine_wrapped.backward(loss, **kwargs) File "/share/home/wenqingchen/miniconda3/envs/pro/lib/python3.9/site-packages/accelerate/utils/deepspeed.py", line 167, in backward self.engine.backward(loss, **kwargs) File "/share/home/wenqingchen/miniconda3/envs/pro/lib/python3.9/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn ret_val = func(*args, **kwargs) File "/share/home/wenqingchen/miniconda3/envs/pro/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1976, in backward self.optimizer.backward(loss, retain_graph=retain_graph) File "/share/home/wenqingchen/miniconda3/envs/pro/lib/python3.9/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 2051, in backward self.loss_scaler.backward(loss.float(), retain_graph=retain_graph) File "/share/home/wenqingchen/miniconda3/envs/pro/lib/python3.9/site-packages/deepspeed/runtime/fp16/loss_scaler.py", line 63, in backward scaled_loss.backward(retain_graph=retain_graph) File "/share/home/wenqingchen/miniconda3/envs/pro/lib/python3.9/site-packages/torch/_tensor.py", line 488, in backward torch.autograd.backward( File "/share/home/wenqingchen/miniconda3/envs/pro/lib/python3.9/site-packages/torch/autograd/__init__.py", line 197, in backward Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 592.00 MiB (GPU 4; 79.15 GiB total capacity; 76.09 GiB already allocated; 269.31 MiB free; 78.25 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONFzero3配置如下:
compute_environment: LOCAL_MACHINE deepspeed_config: gradient_accumulation_steps: 8 gradient_clipping: 1.0 offload_optimizer_device: none offload_param_device: none zero3_init_flag: true zero3_save_16bit_model: true zero_stage: 3 # 改为使用Zero3 distributed_type: DEEPSPEED downcast_bf16: 'no' dynamo_backend: 'NO' fsdp_config: {} machine_rank: 0 main_training_function: main megatron_lm_config: {} mixed_precision: bf16 num_machines: 1 num_processes: 8 rdzv_backend: static same_network: true use_cpu: false原本是上了cpu offload的,但是deepspeed报错强制要求使用官方的优化器,而不是pro代码里的torch优化器,但我不知道怎么改... 不过话说回来,bs=1,block_size 512,模型是13B,这个配置对显存要求应该不高呀,是不是还有什么库的版本不对?
感谢你的详细描述,以下是逐条回复:
- 优化器定义在
process_manager.py的153行。我们用的版本正是一级目录下requirements.txt里描述的版本。如果使用其他版本需要修改优化器可以直接在这里改,但我也不太了解这一特性(比如使用其他优化器)是否受accelerate支持。(现在使用的AdamW确实也比较占显存)- 代码实现里没有直接用deepspeed的高级特性,因此应该对package版本不敏感,只要升级后能成功运行就可以。
- 对您提到的
bs=1,block_size 512,模型是13B这一设置,因我目前没有能使用的8卡机器,没法尝试复现。可以考虑设置sh脚本里ranking_len=1试一下是否能正常训练(等同于SFT)。我们之前有一种简单的换算,比如,bs=1, ranking_len=2和bs=2, ranking_len=1所需资源应该差不多。- 很抱歉,在 pro #69 中我对
do_validation的描述不够清晰。这个选项本身和显存无关,而是开启之后会在可见显卡中选择最后一张放置reward model用于validation。因此,假定您在使用一台8卡机器,关闭do_validation后需要将sh脚本的--num_processes 7也修改一下,比如修改为8。若仅在ds_config.yaml中修改是无效的,因为在命令中直接指定的值会更优先(当然在关掉do_validation后,于命令中直接删掉--num_processes 7,之后通过ds_config.yaml控制用卡数量也可以)。- 对您提到升级deepspeed后能顺利通过prepare,这个问题的原因我也不太了解。我注意到您使用的
tigerbot-13b-base是基于transformers 4.31.0的,而在后续版本中transformers确实修改了llama的实现。因此,您或可尝试将所有package都升级至比较新的版本,如上所述,PRO的代码实现应该是对具体版本不敏感的,只要package之间互相能兼容就可以。
感谢您的后续跟进和建议。
1、我后面有注意到do_validation的实际功能,去掉该参数后,我有将--num_processes设置为8,将最后一张卡也利用起来。
2、减小ranking_len
按照您的建议,从2调为1,可以跑,显存使用量77G/80G,这不太合理,我之前预训练同样的13B模型,block_size 512,per_device_train_batch_size可以设为64
升级torch版本,跑同样的程序,显存使用量66G/80G,使用量有减少,但调为2还是爆了
3、升级库 升级transformers,没用,索性将全部库都更新一遍,还是没用。
4、更换小模型
更换成一个1.3B的小模型,可以跑。
目前还找不到原因,后面要看能不能申请更多gpu来调试了...
hello,在 #69 中也已回复您。 经看这段log,OOM出现在accelerator的prepare处,此处是将LLM通过DeepSpeed分发到每张卡上。因看到您在command在设置num_process=2,即2张卡用于训练LLM,所以问题应不是出在训练过程上(因此与batch_size和block_size无关)。您可以尝试以下方法是否有效:
- 尝试用一个空白脚本,只包括使用accelerator.prepare来初始化您的LLM checkpoint,观察是否能复现OOM的报错
- 如果1仍会OOM,说明单卡装不下13B的模型(可能因为zero-2会在每张卡上都放置一份完整的模型参数,比如您设置的模型精度较高,即可能出现OOM),可以尝试改用zero-3和更低精度(可在process_manager.py的init中,直接在from_pretrained里添加dtype)
您好,感谢回复。 我尝试您说的用accelerator.prepare来初始化13B的模型,代码如下(用gpt生成的,不知是否有误):
import os import time os.environ["CUDA_VISIBLE_DEVICES"] = "0" import torch from accelerate import Accelerator from transformers import AutoModelForCausalLM, AutoTokenizer def main(): # Initialize accelerator accelerator = Accelerator() path = "/mnt/data2/finLLM/models/tigerbot-13b-base" # Load tokenizer tokenizer = AutoTokenizer.from_pretrained(path) # Load model model = AutoModelForCausalLM.from_pretrained(path) # Prepare model with accelerator model, _, = accelerator.prepare(model, tokenizer) # Removed the unnecessary unpacking # Free up CUDA memory torch.cuda.empty_cache() # Pause execution for 5 minutes print("Model loaded successfully. Pausing execution for 5 minutes.") time.sleep(300) # 300 seconds = 5 minutes if __name__ == "__main__": main()GPU不会爆OOM,单卡A800占大概50G,另外模型的精度如下: "torch_dtype": "bfloat16", 这个精度似乎是比较常见的?我觉得不是模型过大或者精度较高的原因,之前有在别人的开源训练项目上有用这个模型去训练,也是用zero-2+16位精度,不会出现OOM的问题。 另外,不知道库的版本会不会影响? 我在部署本项目时,有遇到bug,更新了部分库的版本:
Package Version ----------------------- ----------- absl-py 2.1.0 accelerate 0.17.1 aiohttp 3.9.3 aiosignal 1.3.1 annotated-types 0.6.0 async-timeout 4.0.3 attrs 23.2.0 certifi 2024.2.2 charset-normalizer 2.0.4 click 8.1.7 datasets 2.18.0 deepspeed 0.8.1 dill 0.3.6 evaluate 0.4.1 filelock 3.13.1 frozenlist 1.4.1 fsspec 2024.2.0 grpcio 1.62.1 hjson 3.1.0 huggingface-hub 0.21.4 idna 3.4 importlib_metadata 7.0.2 joblib 1.3.2 Markdown 3.6 MarkupSafe 2.1.5 mkl-fft 1.3.8 mkl-random 1.2.4 mkl-service 2.4.0 multidict 6.0.5 multiprocess 0.70.14 ninja 1.11.1.1 nltk 3.8.1 numpy 1.22.2 packaging 24.0 pandas 2.0.3 peft 0.3.0 pillow 10.2.0 pip 23.3.1 protobuf 5.26.0 psutil 5.9.8 py-cpuinfo 9.0.0 pyarrow 15.0.2 pyarrow-hotfix 0.6 pydantic 1.10.9 pydantic_core 2.16.3 python-dateutil 2.9.0.post0 pytz 2024.1 PyYAML 6.0.1 regex 2023.12.25 requests 2.31.0 responses 0.18.0 rouge-score 0.1.2 scipy 1.11.1 sentencepiece 0.2.0 setuptools 68.2.2 six 1.16.0 tensorboard 2.16.2 tensorboard-data-server 0.7.2 tokenizers 0.13.3 torch 1.13.1 torchaudio 0.13.1 torchvision 0.14.1 tqdm 4.64.1 transformers 4.28.1 typing_extensions 4.9.0 tzdata 2024.1 urllib3 2.1.0 Werkzeug 3.0.1 wheel 0.41.2 xxhash 3.4.1 yarl 1.9.4 zipp 3.18.1您好,感谢您提供的详细信息。您回复的这段代码在运行时应该没有使用DeepSpeed,且指定1张显卡,这样会与直接不使用accelerate在效果上没有区别,所以和PRO的运行环境还不完全一样(因为使用DeepSpeed的话,应该会要求prepare时必须传入dataloader,这也是我们在代码里设置了一个placeholder_dataloader的原因)。您可尝试在PRO的代码process_manager.py中,于accelerator.prepare后直接添加如下代码:
if accelerator. wait_for_everyone(): exit()并观察运行至此时是否还会OOM,据此再考虑下一步debug计划。 如果bf16的显存占用是50G左右,确实不应在prepare阶段就OOM。您或可考虑在AutoModelForCausalLM.from_pretrained()中直接指定torch_dtype=torch.bfloat16试一下?
对训练代码不太懂,感谢您的建议,尝试结果如下: 1、直接在process_manager.py中添加您提供的代码:
model, optimizer, _ = self.accelerator.prepare( self.model, optimizer, placeholder_dataloader ) if self.accelerator.wait_for_everyone(): print("[info] self.accelerator.wait_for_everyone() True") exit()在3个GPU上运行,结果3个GPU都在prepare()爆了
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 16.53 GiB (GPU 1; 79.15 GiB total capacity; 74.40 GiB already allocated; 4.02 GiB free; 74.41 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF state['exp_avg_sq'] = torch.zeros_like(p, memory_format=torch.preserve_format) torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 16.53 GiB (GPU 0; 79.15 GiB total capacity; 74.40 GiB already allocated; 3.99 GiB free; 74.41 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF state['exp_avg_sq'] = torch.zeros_like(p, memory_format=torch.preserve_format) torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 16.53 GiB (GPU 2; 79.15 GiB total capacity; 74.40 GiB already allocated; 4.03 GiB free; 74.41 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF2、手动设置精度: self.model = AutoModelForCausalLM.from_pretrained(self.model_path,config=self.model_config, torch_dtype=torch.bfloat16) 3张卡还是爆了:
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 16.53 GiB (GPU 2; 79.15 GiB total capacity; 74.40 GiB already allocated; 4.03 GiB free; 74.41 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 16.53 GiB (GPU 1; 79.15 GiB total capacity; 74.40 GiB already allocated; 4.02 GiB free; 74.41 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 16.53 GiB (GPU 0; 79.15 GiB total capacity; 74.40 GiB already allocated; 3.99 GiB free; 74.41 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF超出的显存大小跟设置前一样,应该能排除精度这一因素。 3、关闭do_validation & 更新deepspeed 另外看到#69有提到说关闭do_validation,爆了,将deepspeed改为zero3,显示版本过低,更新deepspeed: Found existing installation: deepspeed 0.8.1 Uninstalling deepspeed-0.8.1: Successfully uninstalled deepspeed-0.8.1 Successfully installed deepspeed-0.14.0 pynvml-11.5.0 更新了之后初始化不会报错了,换回zero2也不会,看样子是deepspeed版本的问题,这是为什么? 但是在train loop爆了,将gpu增加到8张,还是爆了:
0%| | 0/5026 [00:00<?, ?it/s] Epoch 0 starts Load training data from ../data/hh_train_len2/train.json 0%| | 1/5026 [01:13<102:21:28, 73.33s/it]Traceback (most recent call last): File "/share/home/wenqingchen/finLLM/DAMO-ConvAI-main/PRO/train/main.py", line 45, in <module> model = process_manager.train() File "/share/home/wenqingchen/finLLM/DAMO-ConvAI-main/PRO/train/utils/process_manager.py", line 261, in train self.compute_loss(model, batch, print_loss) File "/share/home/wenqingchen/finLLM/DAMO-ConvAI-main/PRO/train/utils/process_manager.py", line 111, in compute_loss self.accelerator.backward(total_loss) File "/share/home/wenqingchen/miniconda3/envs/pro/lib/python3.9/site-packages/accelerate/accelerator.py", line 1630, in backward self.deepspeed_engine_wrapped.backward(loss, **kwargs) File "/share/home/wenqingchen/miniconda3/envs/pro/lib/python3.9/site-packages/accelerate/utils/deepspeed.py", line 167, in backward self.engine.backward(loss, **kwargs) File "/share/home/wenqingchen/miniconda3/envs/pro/lib/python3.9/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn ret_val = func(*args, **kwargs) File "/share/home/wenqingchen/miniconda3/envs/pro/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1976, in backward self.optimizer.backward(loss, retain_graph=retain_graph) File "/share/home/wenqingchen/miniconda3/envs/pro/lib/python3.9/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 2051, in backward self.loss_scaler.backward(loss.float(), retain_graph=retain_graph) File "/share/home/wenqingchen/miniconda3/envs/pro/lib/python3.9/site-packages/deepspeed/runtime/fp16/loss_scaler.py", line 63, in backward scaled_loss.backward(retain_graph=retain_graph) File "/share/home/wenqingchen/miniconda3/envs/pro/lib/python3.9/site-packages/torch/_tensor.py", line 488, in backward torch.autograd.backward( File "/share/home/wenqingchen/miniconda3/envs/pro/lib/python3.9/site-packages/torch/autograd/__init__.py", line 197, in backward Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 592.00 MiB (GPU 4; 79.15 GiB total capacity; 76.09 GiB already allocated; 269.31 MiB free; 78.25 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONFzero3配置如下:
compute_environment: LOCAL_MACHINE deepspeed_config: gradient_accumulation_steps: 8 gradient_clipping: 1.0 offload_optimizer_device: none offload_param_device: none zero3_init_flag: true zero3_save_16bit_model: true zero_stage: 3 # 改为使用Zero3 distributed_type: DEEPSPEED downcast_bf16: 'no' dynamo_backend: 'NO' fsdp_config: {} machine_rank: 0 main_training_function: main megatron_lm_config: {} mixed_precision: bf16 num_machines: 1 num_processes: 8 rdzv_backend: static same_network: true use_cpu: false原本是上了cpu offload的,但是deepspeed报错强制要求使用官方的优化器,而不是pro代码里的torch优化器,但我不知道怎么改... 不过话说回来,bs=1,block_size 512,模型是13B,这个配置对显存要求应该不高呀,是不是还有什么库的版本不对?
感谢你的详细描述,以下是逐条回复:
- 优化器定义在
process_manager.py的153行。我们用的版本正是一级目录下requirements.txt里描述的版本。如果使用其他版本需要修改优化器可以直接在这里改,但我也不太了解这一特性(比如使用其他优化器)是否受accelerate支持。(现在使用的AdamW确实也比较占显存)- 代码实现里没有直接用deepspeed的高级特性,因此应该对package版本不敏感,只要升级后能成功运行就可以。
- 对您提到的
bs=1,block_size 512,模型是13B这一设置,因我目前没有能使用的8卡机器,没法尝试复现。可以考虑设置sh脚本里ranking_len=1试一下是否能正常训练(等同于SFT)。我们之前有一种简单的换算,比如,bs=1, ranking_len=2和bs=2, ranking_len=1所需资源应该差不多。- 很抱歉,在 pro #69 中我对
do_validation的描述不够清晰。这个选项本身和显存无关,而是开启之后会在可见显卡中选择最后一张放置reward model用于validation。因此,假定您在使用一台8卡机器,关闭do_validation后需要将sh脚本的--num_processes 7也修改一下,比如修改为8。若仅在ds_config.yaml中修改是无效的,因为在命令中直接指定的值会更优先(当然在关掉do_validation后,于命令中直接删掉--num_processes 7,之后通过ds_config.yaml控制用卡数量也可以)。- 对您提到升级deepspeed后能顺利通过prepare,这个问题的原因我也不太了解。我注意到您使用的
tigerbot-13b-base是基于transformers 4.31.0的,而在后续版本中transformers确实修改了llama的实现。因此,您或可尝试将所有package都升级至比较新的版本,如上所述,PRO的代码实现应该是对具体版本不敏感的,只要package之间互相能兼容就可以。感谢您的后续跟进和建议。 1、我后面有注意到
do_validation的实际功能,去掉该参数后,我有将--num_processes设置为8,将最后一张卡也利用起来。2、减小ranking_len
按照您的建议,从2调为1,可以跑,显存使用量77G/80G,这不太合理,我之前预训练同样的13B模型,block_size 512,per_device_train_batch_size可以设为64
升级torch版本,跑同样的程序,显存使用量66G/80G,使用量有减少,但调为2还是爆了
3、升级库 升级transformers,没用,索性将全部库都更新一遍,还是没用。
4、更换小模型
更换成一个1.3B的小模型,可以跑。
目前还找不到原因,后面要看能不能申请更多gpu来调试了...
也非常感谢您那边的积极反馈~
我自己还有一个好奇的点是,per_device_train_batch_size=64这个设置,确实是很惊讶可以开这么大,是在peft设置下或者使用了量化模型吗?
hello,在 #69 中也已回复您。 经看这段log,OOM出现在accelerator的prepare处,此处是将LLM通过DeepSpeed分发到每张卡上。因看到您在command在设置num_process=2,即2张卡用于训练LLM,所以问题应不是出在训练过程上(因此与batch_size和block_size无关)。您可以尝试以下方法是否有效:
- 尝试用一个空白脚本,只包括使用accelerator.prepare来初始化您的LLM checkpoint,观察是否能复现OOM的报错
- 如果1仍会OOM,说明单卡装不下13B的模型(可能因为zero-2会在每张卡上都放置一份完整的模型参数,比如您设置的模型精度较高,即可能出现OOM),可以尝试改用zero-3和更低精度(可在process_manager.py的init中,直接在from_pretrained里添加dtype)
您好,感谢回复。 我尝试您说的用accelerator.prepare来初始化13B的模型,代码如下(用gpt生成的,不知是否有误):
import os import time os.environ["CUDA_VISIBLE_DEVICES"] = "0" import torch from accelerate import Accelerator from transformers import AutoModelForCausalLM, AutoTokenizer def main(): # Initialize accelerator accelerator = Accelerator() path = "/mnt/data2/finLLM/models/tigerbot-13b-base" # Load tokenizer tokenizer = AutoTokenizer.from_pretrained(path) # Load model model = AutoModelForCausalLM.from_pretrained(path) # Prepare model with accelerator model, _, = accelerator.prepare(model, tokenizer) # Removed the unnecessary unpacking # Free up CUDA memory torch.cuda.empty_cache() # Pause execution for 5 minutes print("Model loaded successfully. Pausing execution for 5 minutes.") time.sleep(300) # 300 seconds = 5 minutes if __name__ == "__main__": main()GPU不会爆OOM,单卡A800占大概50G,另外模型的精度如下: "torch_dtype": "bfloat16", 这个精度似乎是比较常见的?我觉得不是模型过大或者精度较高的原因,之前有在别人的开源训练项目上有用这个模型去训练,也是用zero-2+16位精度,不会出现OOM的问题。 另外,不知道库的版本会不会影响? 我在部署本项目时,有遇到bug,更新了部分库的版本:
Package Version ----------------------- ----------- absl-py 2.1.0 accelerate 0.17.1 aiohttp 3.9.3 aiosignal 1.3.1 annotated-types 0.6.0 async-timeout 4.0.3 attrs 23.2.0 certifi 2024.2.2 charset-normalizer 2.0.4 click 8.1.7 datasets 2.18.0 deepspeed 0.8.1 dill 0.3.6 evaluate 0.4.1 filelock 3.13.1 frozenlist 1.4.1 fsspec 2024.2.0 grpcio 1.62.1 hjson 3.1.0 huggingface-hub 0.21.4 idna 3.4 importlib_metadata 7.0.2 joblib 1.3.2 Markdown 3.6 MarkupSafe 2.1.5 mkl-fft 1.3.8 mkl-random 1.2.4 mkl-service 2.4.0 multidict 6.0.5 multiprocess 0.70.14 ninja 1.11.1.1 nltk 3.8.1 numpy 1.22.2 packaging 24.0 pandas 2.0.3 peft 0.3.0 pillow 10.2.0 pip 23.3.1 protobuf 5.26.0 psutil 5.9.8 py-cpuinfo 9.0.0 pyarrow 15.0.2 pyarrow-hotfix 0.6 pydantic 1.10.9 pydantic_core 2.16.3 python-dateutil 2.9.0.post0 pytz 2024.1 PyYAML 6.0.1 regex 2023.12.25 requests 2.31.0 responses 0.18.0 rouge-score 0.1.2 scipy 1.11.1 sentencepiece 0.2.0 setuptools 68.2.2 six 1.16.0 tensorboard 2.16.2 tensorboard-data-server 0.7.2 tokenizers 0.13.3 torch 1.13.1 torchaudio 0.13.1 torchvision 0.14.1 tqdm 4.64.1 transformers 4.28.1 typing_extensions 4.9.0 tzdata 2024.1 urllib3 2.1.0 Werkzeug 3.0.1 wheel 0.41.2 xxhash 3.4.1 yarl 1.9.4 zipp 3.18.1您好,感谢您提供的详细信息。您回复的这段代码在运行时应该没有使用DeepSpeed,且指定1张显卡,这样会与直接不使用accelerate在效果上没有区别,所以和PRO的运行环境还不完全一样(因为使用DeepSpeed的话,应该会要求prepare时必须传入dataloader,这也是我们在代码里设置了一个placeholder_dataloader的原因)。您可尝试在PRO的代码process_manager.py中,于accelerator.prepare后直接添加如下代码:
if accelerator. wait_for_everyone(): exit()并观察运行至此时是否还会OOM,据此再考虑下一步debug计划。 如果bf16的显存占用是50G左右,确实不应在prepare阶段就OOM。您或可考虑在AutoModelForCausalLM.from_pretrained()中直接指定torch_dtype=torch.bfloat16试一下?
对训练代码不太懂,感谢您的建议,尝试结果如下: 1、直接在process_manager.py中添加您提供的代码:
model, optimizer, _ = self.accelerator.prepare( self.model, optimizer, placeholder_dataloader ) if self.accelerator.wait_for_everyone(): print("[info] self.accelerator.wait_for_everyone() True") exit()在3个GPU上运行,结果3个GPU都在prepare()爆了
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 16.53 GiB (GPU 1; 79.15 GiB total capacity; 74.40 GiB already allocated; 4.02 GiB free; 74.41 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF state['exp_avg_sq'] = torch.zeros_like(p, memory_format=torch.preserve_format) torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 16.53 GiB (GPU 0; 79.15 GiB total capacity; 74.40 GiB already allocated; 3.99 GiB free; 74.41 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF state['exp_avg_sq'] = torch.zeros_like(p, memory_format=torch.preserve_format) torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 16.53 GiB (GPU 2; 79.15 GiB total capacity; 74.40 GiB already allocated; 4.03 GiB free; 74.41 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF2、手动设置精度: self.model = AutoModelForCausalLM.from_pretrained(self.model_path,config=self.model_config, torch_dtype=torch.bfloat16) 3张卡还是爆了:
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 16.53 GiB (GPU 2; 79.15 GiB total capacity; 74.40 GiB already allocated; 4.03 GiB free; 74.41 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 16.53 GiB (GPU 1; 79.15 GiB total capacity; 74.40 GiB already allocated; 4.02 GiB free; 74.41 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 16.53 GiB (GPU 0; 79.15 GiB total capacity; 74.40 GiB already allocated; 3.99 GiB free; 74.41 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF超出的显存大小跟设置前一样,应该能排除精度这一因素。 3、关闭do_validation & 更新deepspeed 另外看到#69有提到说关闭do_validation,爆了,将deepspeed改为zero3,显示版本过低,更新deepspeed: Found existing installation: deepspeed 0.8.1 Uninstalling deepspeed-0.8.1: Successfully uninstalled deepspeed-0.8.1 Successfully installed deepspeed-0.14.0 pynvml-11.5.0 更新了之后初始化不会报错了,换回zero2也不会,看样子是deepspeed版本的问题,这是为什么? 但是在train loop爆了,将gpu增加到8张,还是爆了:
0%| | 0/5026 [00:00<?, ?it/s] Epoch 0 starts Load training data from ../data/hh_train_len2/train.json 0%| | 1/5026 [01:13<102:21:28, 73.33s/it]Traceback (most recent call last): File "/share/home/wenqingchen/finLLM/DAMO-ConvAI-main/PRO/train/main.py", line 45, in <module> model = process_manager.train() File "/share/home/wenqingchen/finLLM/DAMO-ConvAI-main/PRO/train/utils/process_manager.py", line 261, in train self.compute_loss(model, batch, print_loss) File "/share/home/wenqingchen/finLLM/DAMO-ConvAI-main/PRO/train/utils/process_manager.py", line 111, in compute_loss self.accelerator.backward(total_loss) File "/share/home/wenqingchen/miniconda3/envs/pro/lib/python3.9/site-packages/accelerate/accelerator.py", line 1630, in backward self.deepspeed_engine_wrapped.backward(loss, **kwargs) File "/share/home/wenqingchen/miniconda3/envs/pro/lib/python3.9/site-packages/accelerate/utils/deepspeed.py", line 167, in backward self.engine.backward(loss, **kwargs) File "/share/home/wenqingchen/miniconda3/envs/pro/lib/python3.9/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn ret_val = func(*args, **kwargs) File "/share/home/wenqingchen/miniconda3/envs/pro/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1976, in backward self.optimizer.backward(loss, retain_graph=retain_graph) File "/share/home/wenqingchen/miniconda3/envs/pro/lib/python3.9/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 2051, in backward self.loss_scaler.backward(loss.float(), retain_graph=retain_graph) File "/share/home/wenqingchen/miniconda3/envs/pro/lib/python3.9/site-packages/deepspeed/runtime/fp16/loss_scaler.py", line 63, in backward scaled_loss.backward(retain_graph=retain_graph) File "/share/home/wenqingchen/miniconda3/envs/pro/lib/python3.9/site-packages/torch/_tensor.py", line 488, in backward torch.autograd.backward( File "/share/home/wenqingchen/miniconda3/envs/pro/lib/python3.9/site-packages/torch/autograd/__init__.py", line 197, in backward Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 592.00 MiB (GPU 4; 79.15 GiB total capacity; 76.09 GiB already allocated; 269.31 MiB free; 78.25 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONFzero3配置如下:
compute_environment: LOCAL_MACHINE deepspeed_config: gradient_accumulation_steps: 8 gradient_clipping: 1.0 offload_optimizer_device: none offload_param_device: none zero3_init_flag: true zero3_save_16bit_model: true zero_stage: 3 # 改为使用Zero3 distributed_type: DEEPSPEED downcast_bf16: 'no' dynamo_backend: 'NO' fsdp_config: {} machine_rank: 0 main_training_function: main megatron_lm_config: {} mixed_precision: bf16 num_machines: 1 num_processes: 8 rdzv_backend: static same_network: true use_cpu: false原本是上了cpu offload的,但是deepspeed报错强制要求使用官方的优化器,而不是pro代码里的torch优化器,但我不知道怎么改... 不过话说回来,bs=1,block_size 512,模型是13B,这个配置对显存要求应该不高呀,是不是还有什么库的版本不对?
感谢你的详细描述,以下是逐条回复:
- 优化器定义在
process_manager.py的153行。我们用的版本正是一级目录下requirements.txt里描述的版本。如果使用其他版本需要修改优化器可以直接在这里改,但我也不太了解这一特性(比如使用其他优化器)是否受accelerate支持。(现在使用的AdamW确实也比较占显存)- 代码实现里没有直接用deepspeed的高级特性,因此应该对package版本不敏感,只要升级后能成功运行就可以。
- 对您提到的
bs=1,block_size 512,模型是13B这一设置,因我目前没有能使用的8卡机器,没法尝试复现。可以考虑设置sh脚本里ranking_len=1试一下是否能正常训练(等同于SFT)。我们之前有一种简单的换算,比如,bs=1, ranking_len=2和bs=2, ranking_len=1所需资源应该差不多。- 很抱歉,在 pro #69 中我对
do_validation的描述不够清晰。这个选项本身和显存无关,而是开启之后会在可见显卡中选择最后一张放置reward model用于validation。因此,假定您在使用一台8卡机器,关闭do_validation后需要将sh脚本的--num_processes 7也修改一下,比如修改为8。若仅在ds_config.yaml中修改是无效的,因为在命令中直接指定的值会更优先(当然在关掉do_validation后,于命令中直接删掉--num_processes 7,之后通过ds_config.yaml控制用卡数量也可以)。- 对您提到升级deepspeed后能顺利通过prepare,这个问题的原因我也不太了解。我注意到您使用的
tigerbot-13b-base是基于transformers 4.31.0的,而在后续版本中transformers确实修改了llama的实现。因此,您或可尝试将所有package都升级至比较新的版本,如上所述,PRO的代码实现应该是对具体版本不敏感的,只要package之间互相能兼容就可以。感谢您的后续跟进和建议。 1、我后面有注意到
do_validation的实际功能,去掉该参数后,我有将--num_processes设置为8,将最后一张卡也利用起来。 2、减小ranking_len 按照您的建议,从2调为1,可以跑,显存使用量77G/80G,这不太合理,我之前预训练同样的13B模型,block_size 512,per_device_train_batch_size可以设为64 升级torch版本,跑同样的程序,显存使用量66G/80G,使用量有减少,但调为2还是爆了 3、升级库 升级transformers,没用,索性将全部库都更新一遍,还是没用。 4、更换小模型 更换成一个1.3B的小模型,可以跑。 目前还找不到原因,后面要看能不能申请更多gpu来调试了...也非常感谢您那边的积极反馈~ 我自己还有一个好奇的点是,
per_device_train_batch_size=64这个设置,确实是很惊讶可以开这么大,是在peft设置下或者使用了量化模型吗?
噢,是用LoRA了
hello,在 #69 中也已回复您。 经看这段log,OOM出现在accelerator的prepare处,此处是将LLM通过DeepSpeed分发到每张卡上。因看到您在command在设置num_process=2,即2张卡用于训练LLM,所以问题应不是出在训练过程上(因此与batch_size和block_size无关)。您可以尝试以下方法是否有效:
- 尝试用一个空白脚本,只包括使用accelerator.prepare来初始化您的LLM checkpoint,观察是否能复现OOM的报错
- 如果1仍会OOM,说明单卡装不下13B的模型(可能因为zero-2会在每张卡上都放置一份完整的模型参数,比如您设置的模型精度较高,即可能出现OOM),可以尝试改用zero-3和更低精度(可在process_manager.py的init中,直接在from_pretrained里添加dtype)
您好,感谢回复。 我尝试您说的用accelerator.prepare来初始化13B的模型,代码如下(用gpt生成的,不知是否有误):
import os import time os.environ["CUDA_VISIBLE_DEVICES"] = "0" import torch from accelerate import Accelerator from transformers import AutoModelForCausalLM, AutoTokenizer def main(): # Initialize accelerator accelerator = Accelerator() path = "/mnt/data2/finLLM/models/tigerbot-13b-base" # Load tokenizer tokenizer = AutoTokenizer.from_pretrained(path) # Load model model = AutoModelForCausalLM.from_pretrained(path) # Prepare model with accelerator model, _, = accelerator.prepare(model, tokenizer) # Removed the unnecessary unpacking # Free up CUDA memory torch.cuda.empty_cache() # Pause execution for 5 minutes print("Model loaded successfully. Pausing execution for 5 minutes.") time.sleep(300) # 300 seconds = 5 minutes if __name__ == "__main__": main()GPU不会爆OOM,单卡A800占大概50G,另外模型的精度如下: "torch_dtype": "bfloat16", 这个精度似乎是比较常见的?我觉得不是模型过大或者精度较高的原因,之前有在别人的开源训练项目上有用这个模型去训练,也是用zero-2+16位精度,不会出现OOM的问题。 另外,不知道库的版本会不会影响? 我在部署本项目时,有遇到bug,更新了部分库的版本:
Package Version ----------------------- ----------- absl-py 2.1.0 accelerate 0.17.1 aiohttp 3.9.3 aiosignal 1.3.1 annotated-types 0.6.0 async-timeout 4.0.3 attrs 23.2.0 certifi 2024.2.2 charset-normalizer 2.0.4 click 8.1.7 datasets 2.18.0 deepspeed 0.8.1 dill 0.3.6 evaluate 0.4.1 filelock 3.13.1 frozenlist 1.4.1 fsspec 2024.2.0 grpcio 1.62.1 hjson 3.1.0 huggingface-hub 0.21.4 idna 3.4 importlib_metadata 7.0.2 joblib 1.3.2 Markdown 3.6 MarkupSafe 2.1.5 mkl-fft 1.3.8 mkl-random 1.2.4 mkl-service 2.4.0 multidict 6.0.5 multiprocess 0.70.14 ninja 1.11.1.1 nltk 3.8.1 numpy 1.22.2 packaging 24.0 pandas 2.0.3 peft 0.3.0 pillow 10.2.0 pip 23.3.1 protobuf 5.26.0 psutil 5.9.8 py-cpuinfo 9.0.0 pyarrow 15.0.2 pyarrow-hotfix 0.6 pydantic 1.10.9 pydantic_core 2.16.3 python-dateutil 2.9.0.post0 pytz 2024.1 PyYAML 6.0.1 regex 2023.12.25 requests 2.31.0 responses 0.18.0 rouge-score 0.1.2 scipy 1.11.1 sentencepiece 0.2.0 setuptools 68.2.2 six 1.16.0 tensorboard 2.16.2 tensorboard-data-server 0.7.2 tokenizers 0.13.3 torch 1.13.1 torchaudio 0.13.1 torchvision 0.14.1 tqdm 4.64.1 transformers 4.28.1 typing_extensions 4.9.0 tzdata 2024.1 urllib3 2.1.0 Werkzeug 3.0.1 wheel 0.41.2 xxhash 3.4.1 yarl 1.9.4 zipp 3.18.1您好,感谢您提供的详细信息。您回复的这段代码在运行时应该没有使用DeepSpeed,且指定1张显卡,这样会与直接不使用accelerate在效果上没有区别,所以和PRO的运行环境还不完全一样(因为使用DeepSpeed的话,应该会要求prepare时必须传入dataloader,这也是我们在代码里设置了一个placeholder_dataloader的原因)。您可尝试在PRO的代码process_manager.py中,于accelerator.prepare后直接添加如下代码:
if accelerator. wait_for_everyone(): exit()并观察运行至此时是否还会OOM,据此再考虑下一步debug计划。 如果bf16的显存占用是50G左右,确实不应在prepare阶段就OOM。您或可考虑在AutoModelForCausalLM.from_pretrained()中直接指定torch_dtype=torch.bfloat16试一下?
对训练代码不太懂,感谢您的建议,尝试结果如下: 1、直接在process_manager.py中添加您提供的代码:
model, optimizer, _ = self.accelerator.prepare( self.model, optimizer, placeholder_dataloader ) if self.accelerator.wait_for_everyone(): print("[info] self.accelerator.wait_for_everyone() True") exit()在3个GPU上运行,结果3个GPU都在prepare()爆了
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 16.53 GiB (GPU 1; 79.15 GiB total capacity; 74.40 GiB already allocated; 4.02 GiB free; 74.41 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF state['exp_avg_sq'] = torch.zeros_like(p, memory_format=torch.preserve_format) torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 16.53 GiB (GPU 0; 79.15 GiB total capacity; 74.40 GiB already allocated; 3.99 GiB free; 74.41 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF state['exp_avg_sq'] = torch.zeros_like(p, memory_format=torch.preserve_format) torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 16.53 GiB (GPU 2; 79.15 GiB total capacity; 74.40 GiB already allocated; 4.03 GiB free; 74.41 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF2、手动设置精度: self.model = AutoModelForCausalLM.from_pretrained(self.model_path,config=self.model_config, torch_dtype=torch.bfloat16) 3张卡还是爆了:
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 16.53 GiB (GPU 2; 79.15 GiB total capacity; 74.40 GiB already allocated; 4.03 GiB free; 74.41 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 16.53 GiB (GPU 1; 79.15 GiB total capacity; 74.40 GiB already allocated; 4.02 GiB free; 74.41 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 16.53 GiB (GPU 0; 79.15 GiB total capacity; 74.40 GiB already allocated; 3.99 GiB free; 74.41 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF超出的显存大小跟设置前一样,应该能排除精度这一因素。 3、关闭do_validation & 更新deepspeed 另外看到#69有提到说关闭do_validation,爆了,将deepspeed改为zero3,显示版本过低,更新deepspeed: Found existing installation: deepspeed 0.8.1 Uninstalling deepspeed-0.8.1: Successfully uninstalled deepspeed-0.8.1 Successfully installed deepspeed-0.14.0 pynvml-11.5.0 更新了之后初始化不会报错了,换回zero2也不会,看样子是deepspeed版本的问题,这是为什么? 但是在train loop爆了,将gpu增加到8张,还是爆了:
0%| | 0/5026 [00:00<?, ?it/s] Epoch 0 starts Load training data from ../data/hh_train_len2/train.json 0%| | 1/5026 [01:13<102:21:28, 73.33s/it]Traceback (most recent call last): File "/share/home/wenqingchen/finLLM/DAMO-ConvAI-main/PRO/train/main.py", line 45, in <module> model = process_manager.train() File "/share/home/wenqingchen/finLLM/DAMO-ConvAI-main/PRO/train/utils/process_manager.py", line 261, in train self.compute_loss(model, batch, print_loss) File "/share/home/wenqingchen/finLLM/DAMO-ConvAI-main/PRO/train/utils/process_manager.py", line 111, in compute_loss self.accelerator.backward(total_loss) File "/share/home/wenqingchen/miniconda3/envs/pro/lib/python3.9/site-packages/accelerate/accelerator.py", line 1630, in backward self.deepspeed_engine_wrapped.backward(loss, **kwargs) File "/share/home/wenqingchen/miniconda3/envs/pro/lib/python3.9/site-packages/accelerate/utils/deepspeed.py", line 167, in backward self.engine.backward(loss, **kwargs) File "/share/home/wenqingchen/miniconda3/envs/pro/lib/python3.9/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn ret_val = func(*args, **kwargs) File "/share/home/wenqingchen/miniconda3/envs/pro/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1976, in backward self.optimizer.backward(loss, retain_graph=retain_graph) File "/share/home/wenqingchen/miniconda3/envs/pro/lib/python3.9/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 2051, in backward self.loss_scaler.backward(loss.float(), retain_graph=retain_graph) File "/share/home/wenqingchen/miniconda3/envs/pro/lib/python3.9/site-packages/deepspeed/runtime/fp16/loss_scaler.py", line 63, in backward scaled_loss.backward(retain_graph=retain_graph) File "/share/home/wenqingchen/miniconda3/envs/pro/lib/python3.9/site-packages/torch/_tensor.py", line 488, in backward torch.autograd.backward( File "/share/home/wenqingchen/miniconda3/envs/pro/lib/python3.9/site-packages/torch/autograd/__init__.py", line 197, in backward Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 592.00 MiB (GPU 4; 79.15 GiB total capacity; 76.09 GiB already allocated; 269.31 MiB free; 78.25 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONFzero3配置如下:
compute_environment: LOCAL_MACHINE deepspeed_config: gradient_accumulation_steps: 8 gradient_clipping: 1.0 offload_optimizer_device: none offload_param_device: none zero3_init_flag: true zero3_save_16bit_model: true zero_stage: 3 # 改为使用Zero3 distributed_type: DEEPSPEED downcast_bf16: 'no' dynamo_backend: 'NO' fsdp_config: {} machine_rank: 0 main_training_function: main megatron_lm_config: {} mixed_precision: bf16 num_machines: 1 num_processes: 8 rdzv_backend: static same_network: true use_cpu: false原本是上了cpu offload的,但是deepspeed报错强制要求使用官方的优化器,而不是pro代码里的torch优化器,但我不知道怎么改... 不过话说回来,bs=1,block_size 512,模型是13B,这个配置对显存要求应该不高呀,是不是还有什么库的版本不对?
感谢你的详细描述,以下是逐条回复:
- 优化器定义在
process_manager.py的153行。我们用的版本正是一级目录下requirements.txt里描述的版本。如果使用其他版本需要修改优化器可以直接在这里改,但我也不太了解这一特性(比如使用其他优化器)是否受accelerate支持。(现在使用的AdamW确实也比较占显存)- 代码实现里没有直接用deepspeed的高级特性,因此应该对package版本不敏感,只要升级后能成功运行就可以。
- 对您提到的
bs=1,block_size 512,模型是13B这一设置,因我目前没有能使用的8卡机器,没法尝试复现。可以考虑设置sh脚本里ranking_len=1试一下是否能正常训练(等同于SFT)。我们之前有一种简单的换算,比如,bs=1, ranking_len=2和bs=2, ranking_len=1所需资源应该差不多。- 很抱歉,在 pro #69 中我对
do_validation的描述不够清晰。这个选项本身和显存无关,而是开启之后会在可见显卡中选择最后一张放置reward model用于validation。因此,假定您在使用一台8卡机器,关闭do_validation后需要将sh脚本的--num_processes 7也修改一下,比如修改为8。若仅在ds_config.yaml中修改是无效的,因为在命令中直接指定的值会更优先(当然在关掉do_validation后,于命令中直接删掉--num_processes 7,之后通过ds_config.yaml控制用卡数量也可以)。- 对您提到升级deepspeed后能顺利通过prepare,这个问题的原因我也不太了解。我注意到您使用的
tigerbot-13b-base是基于transformers 4.31.0的,而在后续版本中transformers确实修改了llama的实现。因此,您或可尝试将所有package都升级至比较新的版本,如上所述,PRO的代码实现应该是对具体版本不敏感的,只要package之间互相能兼容就可以。感谢您的后续跟进和建议。 1、我后面有注意到
do_validation的实际功能,去掉该参数后,我有将--num_processes设置为8,将最后一张卡也利用起来。 2、减小ranking_len 按照您的建议,从2调为1,可以跑,显存使用量77G/80G,这不太合理,我之前预训练同样的13B模型,block_size 512,per_device_train_batch_size可以设为64 升级torch版本,跑同样的程序,显存使用量66G/80G,使用量有减少,但调为2还是爆了 3、升级库 升级transformers,没用,索性将全部库都更新一遍,还是没用。 4、更换小模型 更换成一个1.3B的小模型,可以跑。 目前还找不到原因,后面要看能不能申请更多gpu来调试了...也非常感谢您那边的积极反馈~ 我自己还有一个好奇的点是,
per_device_train_batch_size=64这个设置,确实是很惊讶可以开这么大,是在peft设置下或者使用了量化模型吗?噢,是用LoRA了
了解,不过用LoRA能开64也很惊讶,可能是使用的卡比较多www。
hello,在 #69 中也已回复您。 经看这段log,OOM出现在accelerator的prepare处,此处是将LLM通过DeepSpeed分发到每张卡上。因看到您在command在设置num_process=2,即2张卡用于训练LLM,所以问题应不是出在训练过程上(因此与batch_size和block_size无关)。您可以尝试以下方法是否有效:
- 尝试用一个空白脚本,只包括使用accelerator.prepare来初始化您的LLM checkpoint,观察是否能复现OOM的报错
- 如果1仍会OOM,说明单卡装不下13B的模型(可能因为zero-2会在每张卡上都放置一份完整的模型参数,比如您设置的模型精度较高,即可能出现OOM),可以尝试改用zero-3和更低精度(可在process_manager.py的init中,直接在from_pretrained里添加dtype)
您好,感谢回复。 我尝试您说的用accelerator.prepare来初始化13B的模型,代码如下(用gpt生成的,不知是否有误):
import os import time os.environ["CUDA_VISIBLE_DEVICES"] = "0" import torch from accelerate import Accelerator from transformers import AutoModelForCausalLM, AutoTokenizer def main(): # Initialize accelerator accelerator = Accelerator() path = "/mnt/data2/finLLM/models/tigerbot-13b-base" # Load tokenizer tokenizer = AutoTokenizer.from_pretrained(path) # Load model model = AutoModelForCausalLM.from_pretrained(path) # Prepare model with accelerator model, _, = accelerator.prepare(model, tokenizer) # Removed the unnecessary unpacking # Free up CUDA memory torch.cuda.empty_cache() # Pause execution for 5 minutes print("Model loaded successfully. Pausing execution for 5 minutes.") time.sleep(300) # 300 seconds = 5 minutes if __name__ == "__main__": main()GPU不会爆OOM,单卡A800占大概50G,另外模型的精度如下: "torch_dtype": "bfloat16", 这个精度似乎是比较常见的?我觉得不是模型过大或者精度较高的原因,之前有在别人的开源训练项目上有用这个模型去训练,也是用zero-2+16位精度,不会出现OOM的问题。 另外,不知道库的版本会不会影响? 我在部署本项目时,有遇到bug,更新了部分库的版本:
Package Version ----------------------- ----------- absl-py 2.1.0 accelerate 0.17.1 aiohttp 3.9.3 aiosignal 1.3.1 annotated-types 0.6.0 async-timeout 4.0.3 attrs 23.2.0 certifi 2024.2.2 charset-normalizer 2.0.4 click 8.1.7 datasets 2.18.0 deepspeed 0.8.1 dill 0.3.6 evaluate 0.4.1 filelock 3.13.1 frozenlist 1.4.1 fsspec 2024.2.0 grpcio 1.62.1 hjson 3.1.0 huggingface-hub 0.21.4 idna 3.4 importlib_metadata 7.0.2 joblib 1.3.2 Markdown 3.6 MarkupSafe 2.1.5 mkl-fft 1.3.8 mkl-random 1.2.4 mkl-service 2.4.0 multidict 6.0.5 multiprocess 0.70.14 ninja 1.11.1.1 nltk 3.8.1 numpy 1.22.2 packaging 24.0 pandas 2.0.3 peft 0.3.0 pillow 10.2.0 pip 23.3.1 protobuf 5.26.0 psutil 5.9.8 py-cpuinfo 9.0.0 pyarrow 15.0.2 pyarrow-hotfix 0.6 pydantic 1.10.9 pydantic_core 2.16.3 python-dateutil 2.9.0.post0 pytz 2024.1 PyYAML 6.0.1 regex 2023.12.25 requests 2.31.0 responses 0.18.0 rouge-score 0.1.2 scipy 1.11.1 sentencepiece 0.2.0 setuptools 68.2.2 six 1.16.0 tensorboard 2.16.2 tensorboard-data-server 0.7.2 tokenizers 0.13.3 torch 1.13.1 torchaudio 0.13.1 torchvision 0.14.1 tqdm 4.64.1 transformers 4.28.1 typing_extensions 4.9.0 tzdata 2024.1 urllib3 2.1.0 Werkzeug 3.0.1 wheel 0.41.2 xxhash 3.4.1 yarl 1.9.4 zipp 3.18.1您好,感谢您提供的详细信息。您回复的这段代码在运行时应该没有使用DeepSpeed,且指定1张显卡,这样会与直接不使用accelerate在效果上没有区别,所以和PRO的运行环境还不完全一样(因为使用DeepSpeed的话,应该会要求prepare时必须传入dataloader,这也是我们在代码里设置了一个placeholder_dataloader的原因)。您可尝试在PRO的代码process_manager.py中,于accelerator.prepare后直接添加如下代码:
if accelerator. wait_for_everyone(): exit()并观察运行至此时是否还会OOM,据此再考虑下一步debug计划。 如果bf16的显存占用是50G左右,确实不应在prepare阶段就OOM。您或可考虑在AutoModelForCausalLM.from_pretrained()中直接指定torch_dtype=torch.bfloat16试一下?
对训练代码不太懂,感谢您的建议,尝试结果如下: 1、直接在process_manager.py中添加您提供的代码:
model, optimizer, _ = self.accelerator.prepare( self.model, optimizer, placeholder_dataloader ) if self.accelerator.wait_for_everyone(): print("[info] self.accelerator.wait_for_everyone() True") exit()在3个GPU上运行,结果3个GPU都在prepare()爆了
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 16.53 GiB (GPU 1; 79.15 GiB total capacity; 74.40 GiB already allocated; 4.02 GiB free; 74.41 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF state['exp_avg_sq'] = torch.zeros_like(p, memory_format=torch.preserve_format) torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 16.53 GiB (GPU 0; 79.15 GiB total capacity; 74.40 GiB already allocated; 3.99 GiB free; 74.41 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF state['exp_avg_sq'] = torch.zeros_like(p, memory_format=torch.preserve_format) torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 16.53 GiB (GPU 2; 79.15 GiB total capacity; 74.40 GiB already allocated; 4.03 GiB free; 74.41 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF2、手动设置精度: self.model = AutoModelForCausalLM.from_pretrained(self.model_path,config=self.model_config, torch_dtype=torch.bfloat16) 3张卡还是爆了:
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 16.53 GiB (GPU 2; 79.15 GiB total capacity; 74.40 GiB already allocated; 4.03 GiB free; 74.41 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 16.53 GiB (GPU 1; 79.15 GiB total capacity; 74.40 GiB already allocated; 4.02 GiB free; 74.41 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 16.53 GiB (GPU 0; 79.15 GiB total capacity; 74.40 GiB already allocated; 3.99 GiB free; 74.41 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF超出的显存大小跟设置前一样,应该能排除精度这一因素。 3、关闭do_validation & 更新deepspeed 另外看到#69有提到说关闭do_validation,爆了,将deepspeed改为zero3,显示版本过低,更新deepspeed: Found existing installation: deepspeed 0.8.1 Uninstalling deepspeed-0.8.1: Successfully uninstalled deepspeed-0.8.1 Successfully installed deepspeed-0.14.0 pynvml-11.5.0 更新了之后初始化不会报错了,换回zero2也不会,看样子是deepspeed版本的问题,这是为什么? 但是在train loop爆了,将gpu增加到8张,还是爆了:
0%| | 0/5026 [00:00<?, ?it/s] Epoch 0 starts Load training data from ../data/hh_train_len2/train.json 0%| | 1/5026 [01:13<102:21:28, 73.33s/it]Traceback (most recent call last): File "/share/home/wenqingchen/finLLM/DAMO-ConvAI-main/PRO/train/main.py", line 45, in <module> model = process_manager.train() File "/share/home/wenqingchen/finLLM/DAMO-ConvAI-main/PRO/train/utils/process_manager.py", line 261, in train self.compute_loss(model, batch, print_loss) File "/share/home/wenqingchen/finLLM/DAMO-ConvAI-main/PRO/train/utils/process_manager.py", line 111, in compute_loss self.accelerator.backward(total_loss) File "/share/home/wenqingchen/miniconda3/envs/pro/lib/python3.9/site-packages/accelerate/accelerator.py", line 1630, in backward self.deepspeed_engine_wrapped.backward(loss, **kwargs) File "/share/home/wenqingchen/miniconda3/envs/pro/lib/python3.9/site-packages/accelerate/utils/deepspeed.py", line 167, in backward self.engine.backward(loss, **kwargs) File "/share/home/wenqingchen/miniconda3/envs/pro/lib/python3.9/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn ret_val = func(*args, **kwargs) File "/share/home/wenqingchen/miniconda3/envs/pro/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1976, in backward self.optimizer.backward(loss, retain_graph=retain_graph) File "/share/home/wenqingchen/miniconda3/envs/pro/lib/python3.9/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 2051, in backward self.loss_scaler.backward(loss.float(), retain_graph=retain_graph) File "/share/home/wenqingchen/miniconda3/envs/pro/lib/python3.9/site-packages/deepspeed/runtime/fp16/loss_scaler.py", line 63, in backward scaled_loss.backward(retain_graph=retain_graph) File "/share/home/wenqingchen/miniconda3/envs/pro/lib/python3.9/site-packages/torch/_tensor.py", line 488, in backward torch.autograd.backward( File "/share/home/wenqingchen/miniconda3/envs/pro/lib/python3.9/site-packages/torch/autograd/__init__.py", line 197, in backward Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 592.00 MiB (GPU 4; 79.15 GiB total capacity; 76.09 GiB already allocated; 269.31 MiB free; 78.25 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONFzero3配置如下:
compute_environment: LOCAL_MACHINE deepspeed_config: gradient_accumulation_steps: 8 gradient_clipping: 1.0 offload_optimizer_device: none offload_param_device: none zero3_init_flag: true zero3_save_16bit_model: true zero_stage: 3 # 改为使用Zero3 distributed_type: DEEPSPEED downcast_bf16: 'no' dynamo_backend: 'NO' fsdp_config: {} machine_rank: 0 main_training_function: main megatron_lm_config: {} mixed_precision: bf16 num_machines: 1 num_processes: 8 rdzv_backend: static same_network: true use_cpu: false原本是上了cpu offload的,但是deepspeed报错强制要求使用官方的优化器,而不是pro代码里的torch优化器,但我不知道怎么改... 不过话说回来,bs=1,block_size 512,模型是13B,这个配置对显存要求应该不高呀,是不是还有什么库的版本不对?
感谢你的详细描述,以下是逐条回复:
- 优化器定义在
process_manager.py的153行。我们用的版本正是一级目录下requirements.txt里描述的版本。如果使用其他版本需要修改优化器可以直接在这里改,但我也不太了解这一特性(比如使用其他优化器)是否受accelerate支持。(现在使用的AdamW确实也比较占显存)- 代码实现里没有直接用deepspeed的高级特性,因此应该对package版本不敏感,只要升级后能成功运行就可以。
- 对您提到的
bs=1,block_size 512,模型是13B这一设置,因我目前没有能使用的8卡机器,没法尝试复现。可以考虑设置sh脚本里ranking_len=1试一下是否能正常训练(等同于SFT)。我们之前有一种简单的换算,比如,bs=1, ranking_len=2和bs=2, ranking_len=1所需资源应该差不多。- 很抱歉,在 pro #69 中我对
do_validation的描述不够清晰。这个选项本身和显存无关,而是开启之后会在可见显卡中选择最后一张放置reward model用于validation。因此,假定您在使用一台8卡机器,关闭do_validation后需要将sh脚本的--num_processes 7也修改一下,比如修改为8。若仅在ds_config.yaml中修改是无效的,因为在命令中直接指定的值会更优先(当然在关掉do_validation后,于命令中直接删掉--num_processes 7,之后通过ds_config.yaml控制用卡数量也可以)。- 对您提到升级deepspeed后能顺利通过prepare,这个问题的原因我也不太了解。我注意到您使用的
tigerbot-13b-base是基于transformers 4.31.0的,而在后续版本中transformers确实修改了llama的实现。因此,您或可尝试将所有package都升级至比较新的版本,如上所述,PRO的代码实现应该是对具体版本不敏感的,只要package之间互相能兼容就可以。感谢您的后续跟进和建议。 1、我后面有注意到
do_validation的实际功能,去掉该参数后,我有将--num_processes设置为8,将最后一张卡也利用起来。 2、减小ranking_len 按照您的建议,从2调为1,可以跑,显存使用量77G/80G,这不太合理,我之前预训练同样的13B模型,block_size 512,per_device_train_batch_size可以设为64 升级torch版本,跑同样的程序,显存使用量66G/80G,使用量有减少,但调为2还是爆了 3、升级库 升级transformers,没用,索性将全部库都更新一遍,还是没用。 4、更换小模型 更换成一个1.3B的小模型,可以跑。 目前还找不到原因,后面要看能不能申请更多gpu来调试了...也非常感谢您那边的积极反馈~ 我自己还有一个好奇的点是,
per_device_train_batch_size=64这个设置,确实是很惊讶可以开这么大,是在peft设置下或者使用了量化模型吗?噢,是用LoRA了
了解,不过用LoRA能开64也很惊讶,可能是使用的卡比较多www。
当时是用6卡训。前辈,有空可否看下邮箱,我有些问题想请教下,给您发了个邮件。