ColossalAI [BUG]: RuntimeError: CUDA error: an illegal memory access was encountered

🐛 Describe the bug

I run the bert from huggingface with zero, but get RuntimeError: CUDA error: an illegal memory access was encountered, I found that this problem seemed to be caused by initial_scale in config.py

Traceback (most recent call last): File "colossalai/run.py", line 463, in train(args) File "colossalai/run.py", line 252, in train trainer(model, File "colossalai/run.py", line 127, in trainer engine.backward(loss) File "/home/paulzhang/miniconda3/lib/python3.8/site-packages/colossalai/engine/_base_engine.py", line 163, in backward ret = self.optimizer.backward(loss) File "/home/paulzhang/miniconda3/lib/python3.8/site-packages/colossalai/zero/sharded_optim/sharded_optim_v2.py", line 169, in backward self.model.backward(loss) File "/home/paulzhang/miniconda3/lib/python3.8/site-packages/colossalai/zero/sharded_model/sharded_model_v2.py", line 233, in backward loss.backward() File "/home/paulzhang/miniconda3/lib/python3.8/site-packages/torch/_tensor.py", line 307, in backward torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs) File "/home/paulzhang/miniconda3/lib/python3.8/site-packages/torch/autograd/init.py", line 154, in backward Variable._execution_engine.run_backward( RuntimeError: CUDA error: an illegal memory access was encountered terminate called after throwing an instance of 'c10::CUDAError' what(): CUDA error: an illegal memory access was encountered Exception raised from create_event_internal at ../c10/cuda/CUDACachingAllocator.cpp:1211 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7f9d1dfa2d62 in /home/paulzhang/miniconda3/lib/python3.8/site-packages/torch/lib/libc10.so) frame #1: + 0x1c5f3 (0x7f9d6164f5f3 in /home/paulzhang/miniconda3/lib/python3.8/site-packages/torch/lib/libc10_cuda.so) frame #2: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0x1a2 (0x7f9d61650002 in /home/paulzhang/miniconda3/lib/python3.8/site-packages/torch/lib/libc10_cuda.so) frame #3: c10::TensorImpl::release_resources() + 0xa4 (0x7f9d1df8c314 in /home/paulzhang/miniconda3/lib/python3.8/site-packages/torch/lib/libc10.so) frame #4: + 0x29adb9 (0x7f9de496cdb9 in /home/paulzhang/miniconda3/lib/python3.8/site-packages/torch/lib/libtorch_python.so) frame #5: + 0xae0c91 (0x7f9de51b2c91 in /home/paulzhang/miniconda3/lib/python3.8/site-packages/torch/lib/libtorch_python.so) frame #6: THPVariable_subclass_dealloc(_object*) + 0x292 (0x7f9de51b2f92 in /home/paulzhang/miniconda3/lib/python3.8/site-packages/torch/lib/libtorch_python.so) frame #7: + 0x15893b (0x56473bab593b in /home/paulzhang/miniconda3/bin/python) frame #8: + 0x193141 (0x56473baf0141 in /home/paulzhang/miniconda3/bin/python) frame #9: + 0x15893b (0x56473bab593b in /home/paulzhang/miniconda3/bin/python) frame #10: + 0x193141 (0x56473baf0141 in /home/paulzhang/miniconda3/bin/python) frame #11: + 0x158415 (0x56473bab5415 in /home/paulzhang/miniconda3/bin/python) frame #12: + 0x15893b (0x56473bab593b in /home/paulzhang/miniconda3/bin/python) frame #13: + 0x193141 (0x56473baf0141 in /home/paulzhang/miniconda3/bin/python) frame #14: + 0x1592ac (0x56473bab62ac in /home/paulzhang/miniconda3/bin/python) frame #15: + 0x158e77 (0x56473bab5e77 in /home/paulzhang/miniconda3/bin/python) frame #16: + 0x158e60 (0x56473bab5e60 in /home/paulzhang/miniconda3/bin/python) frame #17: + 0x158e60 (0x56473bab5e60 in /home/paulzhang/miniconda3/bin/python) frame #18: + 0x176057 (0x56473bad3057 in /home/paulzhang/miniconda3/bin/python) frame #19: PyDict_SetItemString + 0x61 (0x56473baf43c1 in /home/paulzhang/miniconda3/bin/python) frame #20: PyImport_Cleanup + 0x9d (0x56473bb32aad in /home/paulzhang/miniconda3/bin/python) frame #21: Py_FinalizeEx + 0x79 (0x56473bb64a49 in /home/paulzhang/miniconda3/bin/python) frame #22: Py_RunMain + 0x183 (0x56473bb66893 in /home/paulzhang/miniconda3/bin/python) frame #23: Py_BytesMain + 0x39 (0x56473bb66ca9 in /home/paulzhang/miniconda3/bin/python) frame #24: __libc_start_main + 0xf3 (0x7f9e409e50b3 in /lib/x86_64-linux-gnu/libc.so.6) frame #25: + 0x1e21c7 (0x56473bb3f1c7 in /home/paulzhang/miniconda3/bin/python)

Environment

No response

Jul 18 '22 13:07 paulpaulzhang

Could you share me your code?

Jul 20 '22 08:07 ver217

This usually occurs because of CUDA out-of-memory.

Jul 20 '22 08:07 FrankLeeeee

@ver217 this is my code

def trainer(train_dataloader, args, val_dataloader=None):
    start_epoch = 0

    shard_strategy = TensorShardStrategy()
    with ZeroInitContext(target_device=torch.cuda.current_device(), shard_strategy=shard_strategy,
                         shard_param=True):
            config = BertConfig.from_pretrained(args.model_name_or_path, num_labels=200)
            model = BertForSequenceClassification.from_pretrained(args.model_name_or_path, config=config)

    optimizer = HybridAdam(model.parameters(), weight_decay=1e-4)
    criterion = nn.CrossEntropyLoss()

    # 开始colossal初始化
    engine, train_dataloader, val_dataloader, _ = colossalai.initialize(model,
                                                                        optimizer,
                                                                        criterion,
                                                                        train_dataloader,
                                                                        val_dataloader,
                                                                        )

    for epoch in range(start_epoch, args.num_epochs):
        epoch_loss = 0

        train_iter = tqdm(
            train_dataloader, desc=f'Epoch:{epoch + 1}', total=len(train_dataloader))

        engine.train()

        torch.cuda.empty_cache()

        for step, inputs in enumerate(train_iter):
            labels = inputs['labels'].view(-1).to(args.device)
            inputs = {key: inputs[key].to(args.device)
                      for key in inputs.keys() if key not in ['labels']}

            output = engine(inputs['text_input_ids'], attention_mask=inputs['text_mask'])
            loss = engine.criterion(output.logits, labels)

            engine.backward(loss)
            engine.step()
            epoch_loss += loss

            train_iter.set_postfix_str(
                f'loss: {epoch_loss / (step+1):.4f}')

Jul 20 '22 10:07 paulpaulzhang

This usually occurs because of CUDA out-of-memory. Yes, after open the zero, seems to have happened the memory overflow, memory growth until oom. and after turning on Zero, colossalai will output inf during zero

Jul 20 '22 10:07 paulpaulzhang

I have the same problem, I'm sure the GPU memory is enough

Apr 10 '23 11:04 WindCanDie

It means that the cuda and the graphics card are not compatible, just replace one of them

May 26 '23 08:05 Caesar1993