🐛 Describe the bug
I run the bert from huggingface with zero, but get RuntimeError: CUDA error: an illegal memory access was encountered, I found that this problem seemed to be caused by initial_scale in config.py
Traceback (most recent call last):
File "colossalai/run.py", line 463, in
train(args)
File "colossalai/run.py", line 252, in train
trainer(model,
File "colossalai/run.py", line 127, in trainer
engine.backward(loss)
File "/home/paulzhang/miniconda3/lib/python3.8/site-packages/colossalai/engine/_base_engine.py", line 163, in backward
ret = self.optimizer.backward(loss)
File "/home/paulzhang/miniconda3/lib/python3.8/site-packages/colossalai/zero/sharded_optim/sharded_optim_v2.py", line 169, in backward
self.model.backward(loss)
File "/home/paulzhang/miniconda3/lib/python3.8/site-packages/colossalai/zero/sharded_model/sharded_model_v2.py", line 233, in backward
loss.backward()
File "/home/paulzhang/miniconda3/lib/python3.8/site-packages/torch/_tensor.py", line 307, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
File "/home/paulzhang/miniconda3/lib/python3.8/site-packages/torch/autograd/init.py", line 154, in backward
Variable._execution_engine.run_backward(
RuntimeError: CUDA error: an illegal memory access was encountered
terminate called after throwing an instance of 'c10::CUDAError'
what(): CUDA error: an illegal memory access was encountered
Exception raised from create_event_internal at ../c10/cuda/CUDACachingAllocator.cpp:1211 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7f9d1dfa2d62 in /home/paulzhang/miniconda3/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1: + 0x1c5f3 (0x7f9d6164f5f3 in /home/paulzhang/miniconda3/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #2: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0x1a2 (0x7f9d61650002 in /home/paulzhang/miniconda3/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10::TensorImpl::release_resources() + 0xa4 (0x7f9d1df8c314 in /home/paulzhang/miniconda3/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #4: + 0x29adb9 (0x7f9de496cdb9 in /home/paulzhang/miniconda3/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #5: + 0xae0c91 (0x7f9de51b2c91 in /home/paulzhang/miniconda3/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #6: THPVariable_subclass_dealloc(_object*) + 0x292 (0x7f9de51b2f92 in /home/paulzhang/miniconda3/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #7: + 0x15893b (0x56473bab593b in /home/paulzhang/miniconda3/bin/python)
frame #8: + 0x193141 (0x56473baf0141 in /home/paulzhang/miniconda3/bin/python)
frame #9: + 0x15893b (0x56473bab593b in /home/paulzhang/miniconda3/bin/python)
frame #10: + 0x193141 (0x56473baf0141 in /home/paulzhang/miniconda3/bin/python)
frame #11: + 0x158415 (0x56473bab5415 in /home/paulzhang/miniconda3/bin/python)
frame #12: + 0x15893b (0x56473bab593b in /home/paulzhang/miniconda3/bin/python)
frame #13: + 0x193141 (0x56473baf0141 in /home/paulzhang/miniconda3/bin/python)
frame #14: + 0x1592ac (0x56473bab62ac in /home/paulzhang/miniconda3/bin/python)
frame #15: + 0x158e77 (0x56473bab5e77 in /home/paulzhang/miniconda3/bin/python)
frame #16: + 0x158e60 (0x56473bab5e60 in /home/paulzhang/miniconda3/bin/python)
frame #17: + 0x158e60 (0x56473bab5e60 in /home/paulzhang/miniconda3/bin/python)
frame #18: + 0x176057 (0x56473bad3057 in /home/paulzhang/miniconda3/bin/python)
frame #19: PyDict_SetItemString + 0x61 (0x56473baf43c1 in /home/paulzhang/miniconda3/bin/python)
frame #20: PyImport_Cleanup + 0x9d (0x56473bb32aad in /home/paulzhang/miniconda3/bin/python)
frame #21: Py_FinalizeEx + 0x79 (0x56473bb64a49 in /home/paulzhang/miniconda3/bin/python)
frame #22: Py_RunMain + 0x183 (0x56473bb66893 in /home/paulzhang/miniconda3/bin/python)
frame #23: Py_BytesMain + 0x39 (0x56473bb66ca9 in /home/paulzhang/miniconda3/bin/python)
frame #24: __libc_start_main + 0xf3 (0x7f9e409e50b3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #25: + 0x1e21c7 (0x56473bb3f1c7 in /home/paulzhang/miniconda3/bin/python)
Environment
No response
Could you share me your code?
This usually occurs because of CUDA out-of-memory.
@ver217 this is my code
def trainer(train_dataloader, args, val_dataloader=None):
start_epoch = 0
shard_strategy = TensorShardStrategy()
with ZeroInitContext(target_device=torch.cuda.current_device(), shard_strategy=shard_strategy,
shard_param=True):
config = BertConfig.from_pretrained(args.model_name_or_path, num_labels=200)
model = BertForSequenceClassification.from_pretrained(args.model_name_or_path, config=config)
optimizer = HybridAdam(model.parameters(), weight_decay=1e-4)
criterion = nn.CrossEntropyLoss()
# 开始colossal初始化
engine, train_dataloader, val_dataloader, _ = colossalai.initialize(model,
optimizer,
criterion,
train_dataloader,
val_dataloader,
)
for epoch in range(start_epoch, args.num_epochs):
epoch_loss = 0
train_iter = tqdm(
train_dataloader, desc=f'Epoch:{epoch + 1}', total=len(train_dataloader))
engine.train()
torch.cuda.empty_cache()
for step, inputs in enumerate(train_iter):
labels = inputs['labels'].view(-1).to(args.device)
inputs = {key: inputs[key].to(args.device)
for key in inputs.keys() if key not in ['labels']}
output = engine(inputs['text_input_ids'], attention_mask=inputs['text_mask'])
loss = engine.criterion(output.logits, labels)
engine.backward(loss)
engine.step()
epoch_loss += loss
train_iter.set_postfix_str(
f'loss: {epoch_loss / (step+1):.4f}')
This usually occurs because of CUDA out-of-memory.
Yes, after open the zero, seems to have happened the memory overflow, memory growth until oom.
and after turning on Zero, colossalai will output inf during zero
I have the same problem, I'm sure the GPU memory is enough
It means that the cuda and the graphics card are not compatible, just replace one of them