[BUG] RuntimeError: CUDA out of memory
Describe the bug I encountered a strange situation. When I try to finetune a 6B model using one machine with x8 A100s, I get an error:
RuntimeError: CUDA out of memory. Tried to allocate 96.00 MiB (GPU 0; 79.35 GiB total capacity; 9.78 GiB already allocated; 30.19 MiB free; 9.78 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
What is strange is that when I finetune this model using x4 A100s in the same machine with the same config and data, there isn't any problem.
Could you please help me?
This is the config about deepspeed:
conf = {"train_micro_batch_size_per_gpu": 4,
"gradient_accumulation_steps": 2,
"optimizer": {
"type": "Adam",
"params": {
"lr": 1e-5,
# "lr": 1e-6,
"betas": [
0.9,
0.95
],
"eps": 1e-8,
"weight_decay": 5e-4
}
},
# "fp16": {
# "enabled": False,
# },
# "bf16": {
# "enabled": True,
# },
"fp16": {
"enabled": True
},
"zero_optimization": {
"stage": 1,
"offload_optimizer": {
"device": "cpu",
"pin_memory": True
},
"allgather_partitions": True,
"allgather_bucket_size": 2e8,
"overlap_comm": True,
"reduce_scatter": True,
"reduce_bucket_size": 2e8,
"contiguous_gradients": True
},
"steps_per_print": 10
}
System info (please complete the following information):
- GPU count and types: one machine with x8 A100s
- Python version: 3.7.16, Pytorch: 1.12.1, Deepspeed: 0.8.2
You can use nvidia-smi command to check the memory usage of your GPU actually, maybe another program use some GPU memory?
You can use nvidia-smi command to check the memory usage of your GPU actually, maybe another program use some GPU memory?
I check the memory, and no other program uses the same GPU. When I use x4 A100s, logs show that the train.py uses about 60G memory in every GPU, but when I use x8 A100s, logs show that the max usage of GPU memory is less than 1G. It seems that this error appears even before the model is loaded. Theoretically, using x8 A100s should have less usage of GPU memory than using x4 A100s, so I'm very confused about this error.
Hi @hgtttttt, could you provide a simple repro script so we can see how the model is being initialized an to test locally? Thanks.
@hgtttttt could you also provide the full log of when the error occurs so we may examine the stack trace. Thanks.
@jomayeri thanks for your help. this is the main function in my finetune.py:
`def main():
args = set_args()
logger = set_logger(args)
config = ChatGLMConfig.from_pretrained(args.model_dir)
config.pre_seq_len = args.pre_seq_len
config.prefix_projection = args.prefix_projection
model = ChatGLMForConditionalGeneration.from_pretrained(args.model_dir, config=config)
tokenizer = ChatGLMTokenizer.from_pretrained(args.model_dir)
model = model.half().cuda()
if args.prefix_projection:
model.gradient_checkpointing_enable()
conf = {"train_micro_batch_size_per_gpu": args.train_batch_size,
"gradient_accumulation_steps": args.gradient_accumulation_steps,
"optimizer": {
"type": "Adam",
"params": {
"lr": 1e-5,
# "lr": 1e-6,
"betas": [
0.9,
0.95
],
"eps": 1e-8,
"weight_decay": 5e-4
}
},
# "fp16": {
# "enabled": False,
# },
# "bf16": {
# "enabled": True,
# },
"fp16": {
"enabled": True
},
"zero_optimization": {
"stage": 1,
"offload_optimizer": {
"device": "cpu",
"pin_memory": True
},
"allgather_partitions": True,
"allgather_bucket_size": 2e8,
"overlap_comm": True,
"reduce_scatter": True,
"reduce_bucket_size": 2e8,
"contiguous_gradients": True
},
"steps_per_print": args.log_steps
}
for name, param in model.named_parameters():
if not any(nd in name for nd in ["prefix_encoder"]):
param.requires_grad = False
print_trainable_parameters(model)
for name, param in model.named_parameters():
if param.requires_grad == True:
print(name)
logger.info("start to load model.")
train_dataset = Seq2SeqDataSet(args.train_path, tokenizer, args.max_len, args.max_src_len, args.prompt_text)
train_dataloader = DataLoader(train_dataset,
batch_size=conf["train_micro_batch_size_per_gpu"],
sampler=RandomSampler(train_dataset),
collate_fn=coll_fn,
drop_last=True,
num_workers=0)
model_engine, optimizer, _, _ = deepspeed.initialize(config=conf,
model=model,
model_parameters=model.parameters())
logger.info("load successfully. start to train.")
model_engine.train()
global_step = 0
for i_epoch in range(args.num_train_epochs):
train_iter = iter(train_dataloader)
logger.info(f"all global step: {len(train_dataloader) // conf['gradient_accumulation_steps']}")
for step, batch in enumerate(train_iter):
input_ids = batch["input_ids"].cuda()
labels = batch["labels"].cuda()
outputs = model_engine.forward(input_ids=input_ids, labels=labels, use_cache=False)
loss = outputs[0]
if conf["gradient_accumulation_steps"] > 1:
loss = loss / conf["gradient_accumulation_steps"]
model_engine.backward(loss)
torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
if (step + 1) % conf["gradient_accumulation_steps"] == 0:
model_engine.step()
global_step += 1
logger.info(f"update weights: {global_step}")
if global_step % args.log_steps == 0:
logger.info("loss:{}, global_step:{}".format(float(loss.item()), global_step))'
save_dir = f'{args.save_dir}/{args.exp}/{i_epoch}'
model.save_pretrained(save_dir)
copy(os.path.join(args.model_dir, "tokenizer_config.json"), os.path.join(save_dir, "tokenizer_config.json"))
copy(os.path.join(args.model_dir, "ice_text.model"), os.path.join(save_dir, "ice_text.model"))
logger.info(f"epoch {i_epoch} finished!")`
this is the full error log:
`WARNING: There was an error checking the latest version of pip.
Loading checkpoint shards: 0%| | 0/8 [00:00<?, ?it/s] Loading checkpoint shards: 0%| | 0/8 [00:00<?, ?it/s] Loading checkpoint shards: 0%| | 0/8 [00:00<?, ?it/s] Loading checkpoint shards: 0%| | 0/8 [00:00<?, ?it/s] Loading checkpoint shards: 0%| | 0/8 [00:00<?, ?it/s] Loading checkpoint shards: 0%| | 0/8 [00:00<?, ?it/s] Loading checkpoint shards: 0%| | 0/8 [00:00<?, ?it/s] Loading checkpoint shards: 0%| | 0/8 [00:00<?, ?it/s] Loading checkpoint shards: 12%|█▎ | 1/8 [00:06<00:44, 6.40s/it] Loading checkpoint shards: 12%|█▎ | 1/8 [00:08<01:01, 8.80s/it] Loading checkpoint shards: 12%|█▎ | 1/8 [00:08<00:59, 8.47s/it] Loading checkpoint shards: 12%|█▎ | 1/8 [00:10<01:13, 10.53s/it] Loading checkpoint shards: 12%|█▎ | 1/8 [00:08<01:01, 8.78s/it] Loading checkpoint shards: 12%|█▎ | 1/8 [00:10<01:10, 10.02s/it] Loading checkpoint shards: 12%|█▎ | 1/8 [00:07<00:54, 7.78s/it] Loading checkpoint shards: 12%|█▎ | 1/8 [00:10<01:11, 10.24s/it] Loading checkpoint shards: 25%|██▌ | 2/8 [00:18<00:57, 9.64s/it] Loading checkpoint shards: 25%|██▌ | 2/8 [00:20<01:02, 10.47s/it] Loading checkpoint shards: 25%|██▌ | 2/8 [00:20<01:03, 10.61s/it] Loading checkpoint shards: 25%|██▌ | 2/8 [00:22<01:07, 11.31s/it] Loading checkpoint shards: 25%|██▌ | 2/8 [00:21<01:06, 11.09s/it] Loading checkpoint shards: 25%|██▌ | 2/8 [00:20<01:03, 10.61s/it] Loading checkpoint shards: 25%|██▌ | 2/8 [00:19<01:01, 10.17s/it] Loading checkpoint shards: 25%|██▌ | 2/8 [00:22<01:07, 11.18s/it] Loading checkpoint shards: 38%|███▊ | 3/8 [00:31<00:55, 11.06s/it] Loading checkpoint shards: 38%|███▊ | 3/8 [00:35<00:59, 11.97s/it] Loading checkpoint shards: 38%|███▊ | 3/8 [00:34<00:59, 11.85s/it] Loading checkpoint shards: 38%|███▊ | 3/8 [00:33<00:57, 11.59s/it] Loading checkpoint shards: 38%|███▊ | 3/8 [00:33<00:58, 11.63s/it] Loading checkpoint shards: 38%|███▊ | 3/8 [00:34<00:59, 11.90s/it] Loading checkpoint shards: 38%|███▊ | 3/8 [00:33<00:57, 11.56s/it] Loading checkpoint shards: 38%|███▊ | 3/8 [00:32<00:56, 11.36s/it] Loading checkpoint shards: 50%|█████ | 4/8 [00:43<00:46, 11.54s/it] Loading checkpoint shards: 50%|█████ | 4/8 [00:45<00:47, 11.88s/it] Loading checkpoint shards: 50%|█████ | 4/8 [00:47<00:48, 12.12s/it] Loading checkpoint shards: 50%|█████ | 4/8 [00:45<00:47, 11.87s/it] Loading checkpoint shards: 50%|█████ | 4/8 [00:47<00:48, 12.06s/it] Loading checkpoint shards: 50%|█████ | 4/8 [00:45<00:47, 11.91s/it] Loading checkpoint shards: 50%|█████ | 4/8 [00:47<00:48, 12.10s/it] Loading checkpoint shards: 50%|█████ | 4/8 [00:44<00:47, 11.79s/it] Loading checkpoint shards: 62%|██████▎ | 5/8 [00:53<00:32, 10.96s/it] Loading checkpoint shards: 62%|██████▎ | 5/8 [00:54<00:32, 10.99s/it] Loading checkpoint shards: 62%|██████▎ | 5/8 [00:55<00:33, 11.15s/it] Loading checkpoint shards: 62%|██████▎ | 5/8 [00:57<00:34, 11.34s/it] Loading checkpoint shards: 62%|██████▎ | 5/8 [00:55<00:33, 11.25s/it] Loading checkpoint shards: 62%|██████▎ | 5/8 [00:57<00:34, 11.34s/it] Loading checkpoint shards: 62%|██████▎ | 5/8 [00:55<00:33, 11.24s/it] Loading checkpoint shards: 62%|██████▎ | 5/8 [00:57<00:34, 11.41s/it] Loading checkpoint shards: 75%|███████▌ | 6/8 [01:05<00:21, 10.79s/it] Loading checkpoint shards: 75%|███████▌ | 6/8 [01:04<00:21, 10.78s/it] Loading checkpoint shards: 75%|███████▌ | 6/8 [01:07<00:21, 10.89s/it] Loading checkpoint shards: 75%|███████▌ | 6/8 [01:07<00:21, 10.88s/it] Loading checkpoint shards: 75%|███████▌ | 6/8 [01:05<00:21, 10.84s/it] Loading checkpoint shards: 75%|███████▌ | 6/8 [01:03<00:21, 10.81s/it] Loading checkpoint shards: 75%|███████▌ | 6/8 [01:07<00:21, 10.95s/it] Loading checkpoint shards: 75%|███████▌ | 6/8 [01:05<00:21, 10.84s/it] Loading checkpoint shards: 88%|████████▊ | 7/8 [01:10<00:09, 9.39s/it] Loading checkpoint shards: 88%|████████▊ | 7/8 [01:12<00:09, 9.45s/it] Loading checkpoint shards: 88%|████████▊ | 7/8 [01:13<00:09, 9.50s/it] Loading checkpoint shards: 88%|████████▊ | 7/8 [01:14<00:09, 9.51s/it] Loading checkpoint shards: 88%|████████▊ | 7/8 [01:12<00:09, 9.46s/it] Loading checkpoint shards: 88%|████████▊ | 7/8 [01:12<00:09, 9.44s/it] Loading checkpoint shards: 88%|████████▊ | 7/8 [01:13<00:09, 9.50s/it] Loading checkpoint shards: 88%|████████▊ | 7/8 [01:11<00:09, 9.43s/it] Loading checkpoint shards: 100%|██████████| 8/8 [01:16<00:00, 8.53s/it] Loading checkpoint shards: 100%|██████████| 8/8 [01:16<00:00, 9.62s/it] Some weights of ChatGLMForConditionalGeneration were not initialized from the model checkpoint at /work/glm_fine/chatglm-6b and are newly initialized: ['transformer.prefix_encoder.trans.0.bias', 'transformer.prefix_encoder.trans.2.weight', 'transformer.prefix_encoder.embedding.weight', 'transformer.prefix_encoder.trans.2.bias', 'transformer.prefix_encoder.trans.0.weight'] You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Loading checkpoint shards: 100%|██████████| 8/8 [01:19<00:00, 8.58s/it] Loading checkpoint shards: 100%|██████████| 8/8 [01:19<00:00, 9.91s/it] Some weights of ChatGLMForConditionalGeneration were not initialized from the model checkpoint at /work/glm_fine/chatglm-6b and are newly initialized: ['transformer.prefix_encoder.trans.0.bias', 'transformer.prefix_encoder.trans.2.bias', 'transformer.prefix_encoder.trans.2.weight', 'transformer.prefix_encoder.trans.0.weight', 'transformer.prefix_encoder.embedding.weight'] You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Loading checkpoint shards: 100%|██████████| 8/8 [01:20<00:00, 8.62s/it] Loading checkpoint shards: 100%|██████████| 8/8 [01:20<00:00, 10.05s/it] Some weights of ChatGLMForConditionalGeneration were not initialized from the model checkpoint at /work/glm_fine/chatglm-6b and are newly initialized: ['transformer.prefix_encoder.trans.2.bias', 'transformer.prefix_encoder.trans.0.weight', 'transformer.prefix_encoder.embedding.weight', 'transformer.prefix_encoder.trans.2.weight', 'transformer.prefix_encoder.trans.0.bias'] You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Loading checkpoint shards: 100%|██████████| 8/8 [01:20<00:00, 8.63s/it] Loading checkpoint shards: 100%|██████████| 8/8 [01:20<00:00, 10.08s/it] Some weights of ChatGLMForConditionalGeneration were not initialized from the model checkpoint at /work/glm_fine/chatglm-6b and are newly initialized: ['transformer.prefix_encoder.embedding.weight', 'transformer.prefix_encoder.trans.2.bias', 'transformer.prefix_encoder.trans.0.weight', 'transformer.prefix_encoder.trans.0.bias', 'transformer.prefix_encoder.trans.2.weight'] You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Loading checkpoint shards: 100%|██████████| 8/8 [01:18<00:00, 8.60s/it] Loading checkpoint shards: 100%|██████████| 8/8 [01:18<00:00, 9.87s/it] Some weights of ChatGLMForConditionalGeneration were not initialized from the model checkpoint at /work/glm_fine/chatglm-6b and are newly initialized: ['transformer.prefix_encoder.embedding.weight', 'transformer.prefix_encoder.trans.2.bias', 'transformer.prefix_encoder.trans.0.weight', 'transformer.prefix_encoder.trans.2.weight', 'transformer.prefix_encoder.trans.0.bias'] You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Loading checkpoint shards: 100%|██████████| 8/8 [01:18<00:00, 8.58s/it] Loading checkpoint shards: 100%|██████████| 8/8 [01:18<00:00, 9.77s/it] Some weights of ChatGLMForConditionalGeneration were not initialized from the model checkpoint at /work/glm_fine/chatglm-6b and are newly initialized: ['transformer.prefix_encoder.trans.2.weight', 'transformer.prefix_encoder.trans.2.bias', 'transformer.prefix_encoder.embedding.weight', 'transformer.prefix_encoder.trans.0.weight', 'transformer.prefix_encoder.trans.0.bias'] You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Loading checkpoint shards: 100%|██████████| 8/8 [01:21<00:00, 8.63s/it] Loading checkpoint shards: 100%|██████████| 8/8 [01:21<00:00, 10.13s/it]
Loading checkpoint shards: 100%|██████████| 8/8 [01:19<00:00, 8.58s/it]Some weights of ChatGLMForConditionalGeneration were not initialized from the model checkpoint at /work/glm_fine/chatglm-6b and are newly initialized: ['transformer.prefix_encoder.trans.0.bias', 'transformer.prefix_encoder.embedding.weight', 'transformer.prefix_encoder.trans.2.weight', 'transformer.prefix_encoder.trans.2.bias', 'transformer.prefix_encoder.trans.0.weight'] You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Loading checkpoint shards: 100%|██████████| 8/8 [01:19<00:00, 9.91s/it]
Some weights of ChatGLMForConditionalGeneration were not initialized from the model checkpoint at /work/glm_fine/chatglm-6b and are newly initialized: ['transformer.prefix_encoder.trans.0.bias', 'transformer.prefix_encoder.embedding.weight', 'transformer.prefix_encoder.trans.0.weight', 'transformer.prefix_encoder.trans.2.weight', 'transformer.prefix_encoder.trans.2.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
╭───────────────────── Traceback (most recent call last) ──────────────────────╮
│ /work/glm_f │
│ ine/finetuning_pt.py:196 in param_applied, so we have t │
│ 600 │ │ │ # with torch.no_grad(): │
│ 601 │ │ │ with torch.no_grad(): │
│ ❱ 602 │ │ │ │ param_applied = fn(param) │
│ 603 │ │ │ should_use_set_data = compute_should_use_set_data(param, │
│ 604 │ │ │ if should_use_set_data: │
│ 605 │ │ │ │ param.data = param_applied │
│ │
│ /usr/local/conda/envs/llm_fine_tune/lib/python3.8/site-packages/torch/nn/mod │
│ ules/module.py:689 in param_applied, so we have t │
│ 600 │ │ │ # with torch.no_grad(): │
│ 601 │ │ │ with torch.no_grad(): │
│ ❱ 602 │ │ │ │ param_applied = fn(param) │
│ 603 │ │ │ should_use_set_data = compute_should_use_set_data(param, │
│ 604 │ │ │ if should_use_set_data: │
│ 605 │ │ │ │ param.data = param_applied │
│ │
│ /usr/local/conda/envs/llm_fine_tune/lib/python3.8/site-packages/torch/nn/mod │
│ ules/module.py:689 in param_applied, so we have t │
│ 600 │ │ │ # with torch.no_grad(): │
│ 601 │ │ │ with torch.no_grad(): │
│ ❱ 602 │ │ │ │ param_applied = fn(param) │
│ 603 │ │ │ should_use_set_data = compute_should_use_set_data(param, │
│ 604 │ │ │ if should_use_set_data: │
│ 605 │ │ │ │ param.data = param_applied │
│ │
│ /usr/local/conda/envs/llm_fine_tune/lib/python3.8/site-packages/torch/nn/mod │
│ ules/module.py:689 in param_applied, so we have t │
│ 600 │ │ │ # with torch.no_grad(): │
│ 601 │ │ │ with torch.no_grad(): │
│ ❱ 602 │ │ │ │ param_applied = fn(param) │
│ 603 │ │ │ should_use_set_data = compute_should_use_set_data(param, │
│ 604 │ │ │ if should_use_set_data: │
│ 605 │ │ │ │ param.data = param_applied │
│ │
│ /usr/local/conda/envs/llm_fine_tune/lib/python3.8/site-packages/torch/nn/mod │
│ ules/module.py:689 in param_applied, so we have t │
│ 600 │ │ │ # with torch.no_grad(): │
│ 601 │ │ │ with torch.no_grad(): │
│ ❱ 602 │ │ │ │ param_applied = fn(param) │
│ 603 │ │ │ should_use_set_data = compute_should_use_set_data(param, │
│ 604 │ │ │ if should_use_set_data: │
│ 605 │ │ │ │ param.data = param_applied │
│ │
│ /usr/local/conda/envs/llm_fine_tune/lib/python3.8/site-packages/torch/nn/mod │
│ ules/module.py:689 in
this is the result after "nvidia-smi":
` +-----------------------------------------------------------------------------+ | NVIDIA-SMI 470.103.01 Driver Version: 470.103.01 CUDA Version: 11.4 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 NVIDIA A100-SXM... On | 00000000:0E:00.0 Off | 0 | | N/A 30C P0 63W / 400W | 0MiB / 81251MiB | 0% Default | | | | Disabled | +-------------------------------+----------------------+----------------------+ | 1 NVIDIA A100-SXM... On | 00000000:13:00.0 Off | 0 | | N/A 30C P0 65W / 400W | 0MiB / 81251MiB | 0% Default | | | | Disabled | +-------------------------------+----------------------+----------------------+ | 2 NVIDIA A100-SXM... On | 00000000:4B:00.0 Off | 0 | | N/A 29C P0 65W / 400W | 0MiB / 81251MiB | 0% Default | | | | Disabled | +-------------------------------+----------------------+----------------------+ | 3 NVIDIA A100-SXM... On | 00000000:51:00.0 Off | 0 | | N/A 31C P0 64W / 400W | 0MiB / 81251MiB | 0% Default | | | | Disabled | +-------------------------------+----------------------+----------------------+ | 4 NVIDIA A100-SXM... On | 00000000:92:00.0 Off | 0 | | N/A 32C P0 63W / 400W | 0MiB / 81251MiB | 0% Default | | | | Disabled | +-------------------------------+----------------------+----------------------+ | 5 NVIDIA A100-SXM... On | 00000000:98:00.0 Off | 0 | | N/A 29C P0 65W / 400W | 0MiB / 81251MiB | 0% Default | | | | Disabled | +-------------------------------+----------------------+----------------------+ | 6 NVIDIA A100-SXM... On | 00000000:CB:00.0 Off | 0 | | N/A 29C P0 65W / 400W | 0MiB / 81251MiB | 0% Default | | | | Disabled | +-------------------------------+----------------------+----------------------+ | 7 NVIDIA A100-SXM... On | 00000000:D0:00.0 Off | 0 | | N/A 30C P0 63W / 400W | 0MiB / 81251MiB | 0% Default | | | | Disabled | +-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+`
thanks for your help!
Hi @hgtttttt, thanks for the code but could you please provide one that is runnable via a simple copy and paste. Also it appears from the stacktrace that the failure comes before deepspeed.initialize() at the line model.half().cuda(). You don't need to manually convert the model to fp16 DeepSpeed will take care of that internally. Try removing that line and running.
@jomayeri Yes it works! I deleted this line of code and the error disappeared. So maybe there are some conflicts when I manually convert the model to fp16, which only appears when I use x8 A100s. All in all, I'm extremely grateful for your assistance!