DeepSpeed [BUG] RuntimeError: CUDA out of memory

Describe the bug I encountered a strange situation. When I try to finetune a 6B model using one machine with x8 A100s, I get an error:

RuntimeError: CUDA out of memory. Tried to allocate 96.00 MiB (GPU 0; 79.35 GiB total capacity; 9.78 GiB already allocated; 30.19 MiB free; 9.78 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

What is strange is that when I finetune this model using x4 A100s in the same machine with the same config and data, there isn't any problem.

Could you please help me?

This is the config about deepspeed:

conf = {"train_micro_batch_size_per_gpu": 4,
        "gradient_accumulation_steps": 2,
        "optimizer": {
            "type": "Adam",
            "params": {
                "lr": 1e-5,
                # "lr": 1e-6,
                "betas": [
                    0.9,
                    0.95
                ],
                "eps": 1e-8,
                "weight_decay": 5e-4
            }
        },
        # "fp16": {
        #     "enabled": False,
        # },
        # "bf16": {
        #     "enabled": True,
        # },
        "fp16": {
            "enabled": True
        },
        "zero_optimization": {
            "stage": 1,
            "offload_optimizer": {
                "device": "cpu",
                "pin_memory": True
            },
            "allgather_partitions": True,
            "allgather_bucket_size": 2e8,
            "overlap_comm": True,
            "reduce_scatter": True,
            "reduce_bucket_size": 2e8,
            "contiguous_gradients": True
        },
        "steps_per_print": 10
        }

System info (please complete the following information):

GPU count and types: one machine with x8 A100s
Python version: 3.7.16, Pytorch: 1.12.1, Deepspeed: 0.8.2

Apr 20 '23 09:04 hgtttttt

You can use nvidia-smi command to check the memory usage of your GPU actually, maybe another program use some GPU memory?

May 07 '23 12:05 zy-sunshine

You can use nvidia-smi command to check the memory usage of your GPU actually, maybe another program use some GPU memory?

I check the memory, and no other program uses the same GPU. When I use x4 A100s, logs show that the train.py uses about 60G memory in every GPU, but when I use x8 A100s, logs show that the max usage of GPU memory is less than 1G. It seems that this error appears even before the model is loaded. Theoretically, using x8 A100s should have less usage of GPU memory than using x4 A100s, so I'm very confused about this error.

May 08 '23 02:05 hgtttttt

Hi @hgtttttt, could you provide a simple repro script so we can see how the model is being initialized an to test locally? Thanks.

May 12 '23 18:05 jomayeri

@hgtttttt could you also provide the full log of when the error occurs so we may examine the stack trace. Thanks.

May 12 '23 18:05 jomayeri

@jomayeri thanks for your help. this is the main function in my finetune.py:

`def main():

args = set_args()
logger = set_logger(args)
config = ChatGLMConfig.from_pretrained(args.model_dir)
config.pre_seq_len = args.pre_seq_len
config.prefix_projection = args.prefix_projection

model = ChatGLMForConditionalGeneration.from_pretrained(args.model_dir, config=config)
tokenizer = ChatGLMTokenizer.from_pretrained(args.model_dir)
model = model.half().cuda()
if args.prefix_projection:
    model.gradient_checkpointing_enable()

conf = {"train_micro_batch_size_per_gpu": args.train_batch_size,
        "gradient_accumulation_steps": args.gradient_accumulation_steps,
        "optimizer": {
            "type": "Adam",
            "params": {
                "lr": 1e-5,
                # "lr": 1e-6,
                "betas": [
                    0.9,
                    0.95
                ],
                "eps": 1e-8,
                "weight_decay": 5e-4
            }
        },
        # "fp16": {
        #     "enabled": False,
        # },
        # "bf16": {
        #     "enabled": True,
        # },
        "fp16": {
            "enabled": True
        },
        "zero_optimization": {
            "stage": 1,
            "offload_optimizer": {
                "device": "cpu",
                "pin_memory": True
            },
            "allgather_partitions": True,
            "allgather_bucket_size": 2e8,
            "overlap_comm": True,
            "reduce_scatter": True,
            "reduce_bucket_size": 2e8,
            "contiguous_gradients": True
        },
        "steps_per_print": args.log_steps
        }

for name, param in model.named_parameters():
    if not any(nd in name for nd in ["prefix_encoder"]):
        param.requires_grad = False

print_trainable_parameters(model)

for name, param in model.named_parameters():
    if param.requires_grad == True:
        print(name)

logger.info("start to load model.")
train_dataset = Seq2SeqDataSet(args.train_path, tokenizer, args.max_len, args.max_src_len, args.prompt_text)
train_dataloader = DataLoader(train_dataset,
                              batch_size=conf["train_micro_batch_size_per_gpu"],
                              sampler=RandomSampler(train_dataset),
                              collate_fn=coll_fn,
                              drop_last=True,
                              num_workers=0)

model_engine, optimizer, _, _ = deepspeed.initialize(config=conf,
                                                     model=model,
                                                     model_parameters=model.parameters())
logger.info("load successfully. start to train.")
model_engine.train()
global_step = 0
for i_epoch in range(args.num_train_epochs):
    train_iter = iter(train_dataloader)
    logger.info(f"all global step: {len(train_dataloader) // conf['gradient_accumulation_steps']}")
    for step, batch in enumerate(train_iter):
        input_ids = batch["input_ids"].cuda()
        labels = batch["labels"].cuda()
        outputs = model_engine.forward(input_ids=input_ids, labels=labels, use_cache=False)
        loss = outputs[0]
        if conf["gradient_accumulation_steps"] > 1:
            loss = loss / conf["gradient_accumulation_steps"]
        model_engine.backward(loss)
        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
        if (step + 1) % conf["gradient_accumulation_steps"] == 0:
            model_engine.step()
            global_step += 1
            logger.info(f"update weights: {global_step}")
        if global_step % args.log_steps == 0:
            logger.info("loss:{}, global_step:{}".format(float(loss.item()), global_step))'
    save_dir = f'{args.save_dir}/{args.exp}/{i_epoch}'
    model.save_pretrained(save_dir)
    copy(os.path.join(args.model_dir, "tokenizer_config.json"), os.path.join(save_dir, "tokenizer_config.json"))
    copy(os.path.join(args.model_dir, "ice_text.model"), os.path.join(save_dir, "ice_text.model"))
    logger.info(f"epoch {i_epoch} finished!")`

this is the full error log:

`WARNING: There was an error checking the latest version of pip.

Loading checkpoint shards: 0%| | 0/8 [00:00<?, ?it/s] Loading checkpoint shards: 0%| | 0/8 [00:00<?, ?it/s] Loading checkpoint shards: 0%| | 0/8 [00:00<?, ?it/s] Loading checkpoint shards: 0%| | 0/8 [00:00<?, ?it/s] Loading checkpoint shards: 0%| | 0/8 [00:00<?, ?it/s] Loading checkpoint shards: 0%| | 0/8 [00:00<?, ?it/s] Loading checkpoint shards: 0%| | 0/8 [00:00<?, ?it/s] Loading checkpoint shards: 0%| | 0/8 [00:00<?, ?it/s] Loading checkpoint shards: 12%|█▎ | 1/8 [00:06<00:44, 6.40s/it] Loading checkpoint shards: 12%|█▎ | 1/8 [00:08<01:01, 8.80s/it] Loading checkpoint shards: 12%|█▎ | 1/8 [00:08<00:59, 8.47s/it] Loading checkpoint shards: 12%|█▎ | 1/8 [00:10<01:13, 10.53s/it] Loading checkpoint shards: 12%|█▎ | 1/8 [00:08<01:01, 8.78s/it] Loading checkpoint shards: 12%|█▎ | 1/8 [00:10<01:10, 10.02s/it] Loading checkpoint shards: 12%|█▎ | 1/8 [00:07<00:54, 7.78s/it] Loading checkpoint shards: 12%|█▎ | 1/8 [00:10<01:11, 10.24s/it] Loading checkpoint shards: 25%|██▌ | 2/8 [00:18<00:57, 9.64s/it] Loading checkpoint shards: 25%|██▌ | 2/8 [00:20<01:02, 10.47s/it] Loading checkpoint shards: 25%|██▌ | 2/8 [00:20<01:03, 10.61s/it] Loading checkpoint shards: 25%|██▌ | 2/8 [00:22<01:07, 11.31s/it] Loading checkpoint shards: 25%|██▌ | 2/8 [00:21<01:06, 11.09s/it] Loading checkpoint shards: 25%|██▌ | 2/8 [00:20<01:03, 10.61s/it] Loading checkpoint shards: 25%|██▌ | 2/8 [00:19<01:01, 10.17s/it] Loading checkpoint shards: 25%|██▌ | 2/8 [00:22<01:07, 11.18s/it] Loading checkpoint shards: 38%|███▊ | 3/8 [00:31<00:55, 11.06s/it] Loading checkpoint shards: 38%|███▊ | 3/8 [00:35<00:59, 11.97s/it] Loading checkpoint shards: 38%|███▊ | 3/8 [00:34<00:59, 11.85s/it] Loading checkpoint shards: 38%|███▊ | 3/8 [00:33<00:57, 11.59s/it] Loading checkpoint shards: 38%|███▊ | 3/8 [00:33<00:58, 11.63s/it] Loading checkpoint shards: 38%|███▊ | 3/8 [00:34<00:59, 11.90s/it] Loading checkpoint shards: 38%|███▊ | 3/8 [00:33<00:57, 11.56s/it] Loading checkpoint shards: 38%|███▊ | 3/8 [00:32<00:56, 11.36s/it] Loading checkpoint shards: 50%|█████ | 4/8 [00:43<00:46, 11.54s/it] Loading checkpoint shards: 50%|█████ | 4/8 [00:45<00:47, 11.88s/it] Loading checkpoint shards: 50%|█████ | 4/8 [00:47<00:48, 12.12s/it] Loading checkpoint shards: 50%|█████ | 4/8 [00:45<00:47, 11.87s/it] Loading checkpoint shards: 50%|█████ | 4/8 [00:47<00:48, 12.06s/it] Loading checkpoint shards: 50%|█████ | 4/8 [00:45<00:47, 11.91s/it] Loading checkpoint shards: 50%|█████ | 4/8 [00:47<00:48, 12.10s/it] Loading checkpoint shards: 50%|█████ | 4/8 [00:44<00:47, 11.79s/it] Loading checkpoint shards: 62%|██████▎ | 5/8 [00:53<00:32, 10.96s/it] Loading checkpoint shards: 62%|██████▎ | 5/8 [00:54<00:32, 10.99s/it] Loading checkpoint shards: 62%|██████▎ | 5/8 [00:55<00:33, 11.15s/it] Loading checkpoint shards: 62%|██████▎ | 5/8 [00:57<00:34, 11.34s/it] Loading checkpoint shards: 62%|██████▎ | 5/8 [00:55<00:33, 11.25s/it] Loading checkpoint shards: 62%|██████▎ | 5/8 [00:57<00:34, 11.34s/it] Loading checkpoint shards: 62%|██████▎ | 5/8 [00:55<00:33, 11.24s/it] Loading checkpoint shards: 62%|██████▎ | 5/8 [00:57<00:34, 11.41s/it] Loading checkpoint shards: 75%|███████▌ | 6/8 [01:05<00:21, 10.79s/it] Loading checkpoint shards: 75%|███████▌ | 6/8 [01:04<00:21, 10.78s/it] Loading checkpoint shards: 75%|███████▌ | 6/8 [01:07<00:21, 10.89s/it] Loading checkpoint shards: 75%|███████▌ | 6/8 [01:07<00:21, 10.88s/it] Loading checkpoint shards: 75%|███████▌ | 6/8 [01:05<00:21, 10.84s/it] Loading checkpoint shards: 75%|███████▌ | 6/8 [01:03<00:21, 10.81s/it] Loading checkpoint shards: 75%|███████▌ | 6/8 [01:07<00:21, 10.95s/it] Loading checkpoint shards: 75%|███████▌ | 6/8 [01:05<00:21, 10.84s/it] Loading checkpoint shards: 88%|████████▊ | 7/8 [01:10<00:09, 9.39s/it] Loading checkpoint shards: 88%|████████▊ | 7/8 [01:12<00:09, 9.45s/it] Loading checkpoint shards: 88%|████████▊ | 7/8 [01:13<00:09, 9.50s/it] Loading checkpoint shards: 88%|████████▊ | 7/8 [01:14<00:09, 9.51s/it] Loading checkpoint shards: 88%|████████▊ | 7/8 [01:12<00:09, 9.46s/it] Loading checkpoint shards: 88%|████████▊ | 7/8 [01:12<00:09, 9.44s/it] Loading checkpoint shards: 88%|████████▊ | 7/8 [01:13<00:09, 9.50s/it] Loading checkpoint shards: 88%|████████▊ | 7/8 [01:11<00:09, 9.43s/it] Loading checkpoint shards: 100%|██████████| 8/8 [01:16<00:00, 8.53s/it] Loading checkpoint shards: 100%|██████████| 8/8 [01:16<00:00, 9.62s/it] Some weights of ChatGLMForConditionalGeneration were not initialized from the model checkpoint at /work/glm_fine/chatglm-6b and are newly initialized: ['transformer.prefix_encoder.trans.0.bias', 'transformer.prefix_encoder.trans.2.weight', 'transformer.prefix_encoder.embedding.weight', 'transformer.prefix_encoder.trans.2.bias', 'transformer.prefix_encoder.trans.0.weight'] You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

Loading checkpoint shards: 100%|██████████| 8/8 [01:19<00:00, 8.58s/it] Loading checkpoint shards: 100%|██████████| 8/8 [01:19<00:00, 9.91s/it] Some weights of ChatGLMForConditionalGeneration were not initialized from the model checkpoint at /work/glm_fine/chatglm-6b and are newly initialized: ['transformer.prefix_encoder.trans.0.bias', 'transformer.prefix_encoder.trans.2.bias', 'transformer.prefix_encoder.trans.2.weight', 'transformer.prefix_encoder.trans.0.weight', 'transformer.prefix_encoder.embedding.weight'] You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

Loading checkpoint shards: 100%|██████████| 8/8 [01:20<00:00, 8.62s/it] Loading checkpoint shards: 100%|██████████| 8/8 [01:20<00:00, 10.05s/it] Some weights of ChatGLMForConditionalGeneration were not initialized from the model checkpoint at /work/glm_fine/chatglm-6b and are newly initialized: ['transformer.prefix_encoder.trans.2.bias', 'transformer.prefix_encoder.trans.0.weight', 'transformer.prefix_encoder.embedding.weight', 'transformer.prefix_encoder.trans.2.weight', 'transformer.prefix_encoder.trans.0.bias'] You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

Loading checkpoint shards: 100%|██████████| 8/8 [01:20<00:00, 8.63s/it] Loading checkpoint shards: 100%|██████████| 8/8 [01:20<00:00, 10.08s/it] Some weights of ChatGLMForConditionalGeneration were not initialized from the model checkpoint at /work/glm_fine/chatglm-6b and are newly initialized: ['transformer.prefix_encoder.embedding.weight', 'transformer.prefix_encoder.trans.2.bias', 'transformer.prefix_encoder.trans.0.weight', 'transformer.prefix_encoder.trans.0.bias', 'transformer.prefix_encoder.trans.2.weight'] You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

Loading checkpoint shards: 100%|██████████| 8/8 [01:18<00:00, 8.60s/it] Loading checkpoint shards: 100%|██████████| 8/8 [01:18<00:00, 9.87s/it] Some weights of ChatGLMForConditionalGeneration were not initialized from the model checkpoint at /work/glm_fine/chatglm-6b and are newly initialized: ['transformer.prefix_encoder.embedding.weight', 'transformer.prefix_encoder.trans.2.bias', 'transformer.prefix_encoder.trans.0.weight', 'transformer.prefix_encoder.trans.2.weight', 'transformer.prefix_encoder.trans.0.bias'] You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

Loading checkpoint shards: 100%|██████████| 8/8 [01:18<00:00, 8.58s/it] Loading checkpoint shards: 100%|██████████| 8/8 [01:18<00:00, 9.77s/it] Some weights of ChatGLMForConditionalGeneration were not initialized from the model checkpoint at /work/glm_fine/chatglm-6b and are newly initialized: ['transformer.prefix_encoder.trans.2.weight', 'transformer.prefix_encoder.trans.2.bias', 'transformer.prefix_encoder.embedding.weight', 'transformer.prefix_encoder.trans.0.weight', 'transformer.prefix_encoder.trans.0.bias'] You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

Loading checkpoint shards: 100%|██████████| 8/8 [01:21<00:00, 8.63s/it] Loading checkpoint shards: 100%|██████████| 8/8 [01:21<00:00, 10.13s/it]

Loading checkpoint shards: 100%|██████████| 8/8 [01:19<00:00, 8.58s/it]Some weights of ChatGLMForConditionalGeneration were not initialized from the model checkpoint at /work/glm_fine/chatglm-6b and are newly initialized: ['transformer.prefix_encoder.trans.0.bias', 'transformer.prefix_encoder.embedding.weight', 'transformer.prefix_encoder.trans.2.weight', 'transformer.prefix_encoder.trans.2.bias', 'transformer.prefix_encoder.trans.0.weight'] You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

Loading checkpoint shards: 100%|██████████| 8/8 [01:19<00:00, 9.91s/it] Some weights of ChatGLMForConditionalGeneration were not initialized from the model checkpoint at /work/glm_fine/chatglm-6b and are newly initialized: ['transformer.prefix_encoder.trans.0.bias', 'transformer.prefix_encoder.embedding.weight', 'transformer.prefix_encoder.trans.0.weight', 'transformer.prefix_encoder.trans.2.weight', 'transformer.prefix_encoder.trans.2.bias'] You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference. ╭───────────────────── Traceback (most recent call last) ──────────────────────╮ │ /work/glm_f │ │ ine/finetuning_pt.py:196 in │ │ │ │ 193 │ │ 194 │ │ 195 if name == "main": │ │ ❱ 196 │ main() │ │ 197 │ # CUDA_VISIBLE_DEVICES=1 deepspeed finetuning_pt.py │ │ 198 │ │ │ │ /work/glm_f │ │ ine/finetuning_pt.py:95 in main │ │ │ │ 92 │ │ │ 93 │ model = ChatGLMForConditionalGeneration.from_pretrained(args.model │ │ 94 │ tokenizer = ChatGLMTokenizer.from_pretrained(args.model_dir) │ │ ❱ 95 │ model = model.half().cuda() │ │ 96 │ if args.prefix_projection: │ │ 97 │ │ model.gradient_checkpointing_enable() │ │ 98 │ │ │ │ /usr/local/conda/envs/llm_fine_tune/lib/python3.8/site-packages/torch/nn/mod │ │ ules/module.py:689 in cuda │ │ │ │ 686 │ │ Returns: │ │ 687 │ │ │ Module: self │ │ 688 │ │ """ │ │ ❱ 689 │ │ return self._apply(lambda t: t.cuda(device)) │ │ 690 │ │ │ 691 │ def ipu(self: T, device: Optional[Union[int, device]] = None) -> │ │ 692 │ │ r"""Moves all model parameters and buffers to the IPU. │ │ │ │ /usr/local/conda/envs/llm_fine_tune/lib/python3.8/site-packages/torch/nn/mod │ │ ules/module.py:579 in _apply │ │ │ │ 576 │ │ │ 577 │ def _apply(self, fn): │ │ 578 │ │ for module in self.children(): │ │ ❱ 579 │ │ │ module._apply(fn) │ │ 580 │ │ │ │ 581 │ │ def compute_should_use_set_data(tensor, tensor_applied): │ │ 582 │ │ │ if torch._has_compatible_shallow_copy_type(tensor, tensor │ │ │ │ /usr/local/conda/envs/llm_fine_tune/lib/python3.8/site-packages/torch/nn/mod │ │ ules/module.py:579 in _apply │ │ │ │ 576 │ │ │ 577 │ def _apply(self, fn): │ │ 578 │ │ for module in self.children(): │ │ ❱ 579 │ │ │ module._apply(fn) │ │ 580 │ │ │ │ 581 │ │ def compute_should_use_set_data(tensor, tensor_applied): │ │ 582 │ │ │ if torch._has_compatible_shallow_copy_type(tensor, tensor │ │ │ │ /usr/local/conda/envs/llm_fine_tune/lib/python3.8/site-packages/torch/nn/mod │ │ ules/module.py:579 in _apply │ │ │ │ 576 │ │ │ 577 │ def _apply(self, fn): │ │ 578 │ │ for module in self.children(): │ │ ❱ 579 │ │ │ module._apply(fn) │ │ 580 │ │ │ │ 581 │ │ def compute_should_use_set_data(tensor, tensor_applied): │ │ 582 │ │ │ if torch._has_compatible_shallow_copy_type(tensor, tensor │ │ │ │ /usr/local/conda/envs/llm_fine_tune/lib/python3.8/site-packages/torch/nn/mod │ │ ules/module.py:579 in _apply │ │ │ │ 576 │ │ │ 577 │ def _apply(self, fn): │ │ 578 │ │ for module in self.children(): │ │ ❱ 579 │ │ │ module._apply(fn) │ │ 580 │ │ │ │ 581 │ │ def compute_should_use_set_data(tensor, tensor_applied): │ │ 582 │ │ │ if torch._has_compatible_shallow_copy_type(tensor, tensor │ │ │ │ /usr/local/conda/envs/llm_fine_tune/lib/python3.8/site-packages/torch/nn/mod │ │ ules/module.py:579 in _apply │ │ │ │ 576 │ │ │ 577 │ def _apply(self, fn): │ │ 578 │ │ for module in self.children(): │ │ ❱ 579 │ │ │ module._apply(fn) │ │ 580 │ │ │ │ 581 │ │ def compute_should_use_set_data(tensor, tensor_applied): │ │ 582 │ │ │ if torch._has_compatible_shallow_copy_type(tensor, tensor │ │ │ │ /usr/local/conda/envs/llm_fine_tune/lib/python3.8/site-packages/torch/nn/mod │ │ ules/module.py:602 in _apply │ │ │ │ 599 │ │ │ # track autograd history of param_applied, so we have t │ │ 600 │ │ │ # with torch.no_grad(): │ │ 601 │ │ │ with torch.no_grad(): │ │ ❱ 602 │ │ │ │ param_applied = fn(param) │ │ 603 │ │ │ should_use_set_data = compute_should_use_set_data(param, │ │ 604 │ │ │ if should_use_set_data: │ │ 605 │ │ │ │ param.data = param_applied │ │ │ │ /usr/local/conda/envs/llm_fine_tune/lib/python3.8/site-packages/torch/nn/mod │ │ ules/module.py:689 in │ │ │ │ 686 │ │ Returns: │ │ 687 │ │ │ Module: self │ │ 688 │ │ """ │ │ ❱ 689 │ │ return self._apply(lambda t: t.cuda(device)) │ │ 690 │ │ │ 691 │ def ipu(self: T, device: Optional[Union[int, device]] = None) -> │ │ 692 │ │ r"""Moves all model parameters and buffers to the IPU. │ ╰──────────────────────────────────────────────────────────────────────────────╯ RuntimeError: CUDA out of memory. Tried to allocate 128.00 MiB (GPU 0; 79.35 GiB total capacity; 4.27 GiB already allocated; 30.19 MiB free; 4.28 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF ╭───────────────────── Traceback (most recent call last) ──────────────────────╮ │ /work/glm_f │ │ ine/finetuning_pt.py:196 in │ │ │ │ 193 │ │ 194 │ │ 195 if name == "main": │ │ ❱ 196 │ main() │ │ 197 │ # CUDA_VISIBLE_DEVICES=1 deepspeed finetuning_pt.py │ │ 198 │ │ │ │ /work/glm_f │ │ ine/finetuning_pt.py:95 in main │ │ │ │ 92 │ │ │ 93 │ model = ChatGLMForConditionalGeneration.from_pretrained(args.model │ │ 94 │ tokenizer = ChatGLMTokenizer.from_pretrained(args.model_dir) │ │ ❱ 95 │ model = model.half().cuda() │ │ 96 │ if args.prefix_projection: │ │ 97 │ │ model.gradient_checkpointing_enable() │ │ 98 │ │ │ │ /usr/local/conda/envs/llm_fine_tune/lib/python3.8/site-packages/torch/nn/mod │ │ ules/module.py:689 in cuda │ │ │ │ 686 │ │ Returns: │ │ 687 │ │ │ Module: self │ │ 688 │ │ """ │ │ ❱ 689 │ │ return self._apply(lambda t: t.cuda(device)) │ │ 690 │ │ │ 691 │ def ipu(self: T, device: Optional[Union[int, device]] = None) -> │ │ 692 │ │ r"""Moves all model parameters and buffers to the IPU. │ │ │ │ /usr/local/conda/envs/llm_fine_tune/lib/python3.8/site-packages/torch/nn/mod │ │ ules/module.py:579 in _apply │ │ │ │ 576 │ │ │ 577 │ def _apply(self, fn): │ │ 578 │ │ for module in self.children(): │ │ ❱ 579 │ │ │ module._apply(fn) │ │ 580 │ │ │ │ 581 │ │ def compute_should_use_set_data(tensor, tensor_applied): │ │ 582 │ │ │ if torch._has_compatible_shallow_copy_type(tensor, tensor │ │ │ │ /usr/local/conda/envs/llm_fine_tune/lib/python3.8/site-packages/torch/nn/mod │ │ ules/module.py:579 in _apply │ │ │ │ 576 │ │ │ 577 │ def _apply(self, fn): │ │ 578 │ │ for module in self.children(): │ │ ❱ 579 │ │ │ module._apply(fn) │ │ 580 │ │ │ │ 581 │ │ def compute_should_use_set_data(tensor, tensor_applied): │ │ 582 │ │ │ if torch._has_compatible_shallow_copy_type(tensor, tensor │ │ │ │ /usr/local/conda/envs/llm_fine_tune/lib/python3.8/site-packages/torch/nn/mod │ │ ules/module.py:579 in _apply │ │ │ │ 576 │ │ │ 577 │ def _apply(self, fn): │ │ 578 │ │ for module in self.children(): │ │ ❱ 579 │ │ │ module._apply(fn) │ │ 580 │ │ │ │ 581 │ │ def compute_should_use_set_data(tensor, tensor_applied): │ │ 582 │ │ │ if torch._has_compatible_shallow_copy_type(tensor, tensor │ │ │ │ /usr/local/conda/envs/llm_fine_tune/lib/python3.8/site-packages/torch/nn/mod │ │ ules/module.py:579 in _apply │ │ │ │ 576 │ │ │ 577 │ def _apply(self, fn): │ │ 578 │ │ for module in self.children(): │ │ ❱ 579 │ │ │ module._apply(fn) │ │ 580 │ │ │ │ 581 │ │ def compute_should_use_set_data(tensor, tensor_applied): │ │ 582 │ │ │ if torch._has_compatible_shallow_copy_type(tensor, tensor │ │ │ │ /usr/local/conda/envs/llm_fine_tune/lib/python3.8/site-packages/torch/nn/mod │ │ ules/module.py:579 in _apply │ │ │ │ 576 │ │ │ 577 │ def _apply(self, fn): │ │ 578 │ │ for module in self.children(): │ │ ❱ 579 │ │ │ module._apply(fn) │ │ 580 │ │ │ │ 581 │ │ def compute_should_use_set_data(tensor, tensor_applied): │ │ 582 │ │ │ if torch._has_compatible_shallow_copy_type(tensor, tensor │ │ │ │ /usr/local/conda/envs/llm_fine_tune/lib/python3.8/site-packages/torch/nn/mod │ │ ules/module.py:602 in _apply │ │ │ │ 599 │ │ │ # track autograd history of param_applied, so we have t │ │ 600 │ │ │ # with torch.no_grad(): │ │ 601 │ │ │ with torch.no_grad(): │ │ ❱ 602 │ │ │ │ param_applied = fn(param) │ │ 603 │ │ │ should_use_set_data = compute_should_use_set_data(param, │ │ 604 │ │ │ if should_use_set_data: │ │ 605 │ │ │ │ param.data = param_applied │ │ │ │ /usr/local/conda/envs/llm_fine_tune/lib/python3.8/site-packages/torch/nn/mod │ │ ules/module.py:689 in │ │ │ │ 686 │ │ Returns: │ │ 687 │ │ │ Module: self │ │ 688 │ │ """ │ │ ❱ 689 │ │ return self._apply(lambda t: t.cuda(device)) │ │ 690 │ │ │ 691 │ def ipu(self: T, device: Optional[Union[int, device]] = None) -> │ │ 692 │ │ r"""Moves all model parameters and buffers to the IPU. │ ╰──────────────────────────────────────────────────────────────────────────────╯ RuntimeError: CUDA out of memory. Tried to allocate 128.00 MiB (GPU 0; 79.35 GiB total capacity; 5.52 GiB already allocated; 30.19 MiB free; 5.53 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF ╭───────────────────── Traceback (most recent call last) ──────────────────────╮ │ /work/glm_f │ │ ine/finetuning_pt.py:196 in │ │ │ │ 193 │ │ 194 │ │ 195 if name == "main": │ │ ❱ 196 │ main() │ │ 197 │ # CUDA_VISIBLE_DEVICES=1 deepspeed finetuning_pt.py │ │ 198 │ │ │ │ /work/glm_f │ │ ine/finetuning_pt.py:95 in main │ │ │ │ 92 │ │ │ 93 │ model = ChatGLMForConditionalGeneration.from_pretrained(args.model │ │ 94 │ tokenizer = ChatGLMTokenizer.from_pretrained(args.model_dir) │ │ ❱ 95 │ model = model.half().cuda() │ │ 96 │ if args.prefix_projection: │ │ 97 │ │ model.gradient_checkpointing_enable() │ │ 98 │ │ │ │ /usr/local/conda/envs/llm_fine_tune/lib/python3.8/site-packages/torch/nn/mod │ │ ules/module.py:689 in cuda │ │ │ │ 686 │ │ Returns: │ │ 687 │ │ │ Module: self │ │ 688 │ │ """ │ │ ❱ 689 │ │ return self._apply(lambda t: t.cuda(device)) │ │ 690 │ │ │ 691 │ def ipu(self: T, device: Optional[Union[int, device]] = None) -> │ │ 692 │ │ r"""Moves all model parameters and buffers to the IPU. │ │ │ │ /usr/local/conda/envs/llm_fine_tune/lib/python3.8/site-packages/torch/nn/mod │ │ ules/module.py:579 in _apply │ │ │ │ 576 │ │ │ 577 │ def _apply(self, fn): │ │ 578 │ │ for module in self.children(): │ │ ❱ 579 │ │ │ module._apply(fn) │ │ 580 │ │ │ │ 581 │ │ def compute_should_use_set_data(tensor, tensor_applied): │ │ 582 │ │ │ if torch._has_compatible_shallow_copy_type(tensor, tensor │ │ │ │ /usr/local/conda/envs/llm_fine_tune/lib/python3.8/site-packages/torch/nn/mod │ │ ules/module.py:579 in _apply │ │ │ │ 576 │ │ │ 577 │ def _apply(self, fn): │ │ 578 │ │ for module in self.children(): │ │ ❱ 579 │ │ │ module._apply(fn) │ │ 580 │ │ │ │ 581 │ │ def compute_should_use_set_data(tensor, tensor_applied): │ │ 582 │ │ │ if torch._has_compatible_shallow_copy_type(tensor, tensor │ │ │ │ /usr/local/conda/envs/llm_fine_tune/lib/python3.8/site-packages/torch/nn/mod │ │ ules/module.py:579 in _apply │ │ │ │ 576 │ │ │ 577 │ def _apply(self, fn): │ │ 578 │ │ for module in self.children(): │ │ ❱ 579 │ │ │ module._apply(fn) │ │ 580 │ │ │ │ 581 │ │ def compute_should_use_set_data(tensor, tensor_applied): │ │ 582 │ │ │ if torch._has_compatible_shallow_copy_type(tensor, tensor │ │ │ │ /usr/local/conda/envs/llm_fine_tune/lib/python3.8/site-packages/torch/nn/mod │ │ ules/module.py:579 in _apply │ │ │ │ 576 │ │ │ 577 │ def _apply(self, fn): │ │ 578 │ │ for module in self.children(): │ │ ❱ 579 │ │ │ module._apply(fn) │ │ 580 │ │ │ │ 581 │ │ def compute_should_use_set_data(tensor, tensor_applied): │ │ 582 │ │ │ if torch._has_compatible_shallow_copy_type(tensor, tensor │ │ │ │ /usr/local/conda/envs/llm_fine_tune/lib/python3.8/site-packages/torch/nn/mod │ │ ules/module.py:579 in _apply │ │ │ │ 576 │ │ │ 577 │ def _apply(self, fn): │ │ 578 │ │ for module in self.children(): │ │ ❱ 579 │ │ │ module._apply(fn) │ │ 580 │ │ │ │ 581 │ │ def compute_should_use_set_data(tensor, tensor_applied): │ │ 582 │ │ │ if torch._has_compatible_shallow_copy_type(tensor, tensor │ │ │ │ /usr/local/conda/envs/llm_fine_tune/lib/python3.8/site-packages/torch/nn/mod │ │ ules/module.py:602 in _apply │ │ │ │ 599 │ │ │ # track autograd history of param_applied, so we have t │ │ 600 │ │ │ # with torch.no_grad(): │ │ 601 │ │ │ with torch.no_grad(): │ │ ❱ 602 │ │ │ │ param_applied = fn(param) │ │ 603 │ │ │ should_use_set_data = compute_should_use_set_data(param, │ │ 604 │ │ │ if should_use_set_data: │ │ 605 │ │ │ │ param.data = param_applied │ │ │ │ /usr/local/conda/envs/llm_fine_tune/lib/python3.8/site-packages/torch/nn/mod │ │ ules/module.py:689 in │ │ │ │ 686 │ │ Returns: │ │ 687 │ │ │ Module: self │ │ 688 │ │ """ │ │ ❱ 689 │ │ return self._apply(lambda t: t.cuda(device)) │ │ 690 │ │ │ 691 │ def ipu(self: T, device: Optional[Union[int, device]] = None) -> │ │ 692 │ │ r"""Moves all model parameters and buffers to the IPU. │ ╰──────────────────────────────────────────────────────────────────────────────╯ RuntimeError: CUDA out of memory. Tried to allocate 128.00 MiB (GPU 0; 79.35 GiB total capacity; 11.53 GiB already allocated; 30.19 MiB free; 11.53 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF ╭───────────────────── Traceback (most recent call last) ──────────────────────╮ │ /work/glm_f │ │ ine/finetuning_pt.py:196 in │ │ │ │ 193 │ │ 194 │ │ 195 if name == "main": │ │ ❱ 196 │ main() │ │ 197 │ # CUDA_VISIBLE_DEVICES=1 deepspeed finetuning_pt.py │ │ 198 │ │ │ │ /work/glm_f │ │ ine/finetuning_pt.py:95 in main │ │ │ │ 92 │ │ │ 93 │ model = ChatGLMForConditionalGeneration.from_pretrained(args.model │ │ 94 │ tokenizer = ChatGLMTokenizer.from_pretrained(args.model_dir) │ │ ❱ 95 │ model = model.half().cuda() │ │ 96 │ if args.prefix_projection: │ │ 97 │ │ model.gradient_checkpointing_enable() │ │ 98 │ │ │ │ /usr/local/conda/envs/llm_fine_tune/lib/python3.8/site-packages/torch/nn/mod │ │ ules/module.py:689 in cuda │ │ │ │ 686 │ │ Returns: │ │ 687 │ │ │ Module: self │ │ 688 │ │ """ │ │ ❱ 689 │ │ return self._apply(lambda t: t.cuda(device)) │ │ 690 │ │ │ 691 │ def ipu(self: T, device: Optional[Union[int, device]] = None) -> │ │ 692 │ │ r"""Moves all model parameters and buffers to the IPU. │ │ │ │ /usr/local/conda/envs/llm_fine_tune/lib/python3.8/site-packages/torch/nn/mod │ │ ules/module.py:579 in _apply │ │ │ │ 576 │ │ │ 577 │ def _apply(self, fn): │ │ 578 │ │ for module in self.children(): │ │ ❱ 579 │ │ │ module._apply(fn) │ │ 580 │ │ │ │ 581 │ │ def compute_should_use_set_data(tensor, tensor_applied): │ │ 582 │ │ │ if torch._has_compatible_shallow_copy_type(tensor, tensor │ │ │ │ /usr/local/conda/envs/llm_fine_tune/lib/python3.8/site-packages/torch/nn/mod │ │ ules/module.py:579 in _apply │ │ │ │ 576 │ │ │ 577 │ def _apply(self, fn): │ │ 578 │ │ for module in self.children(): │ │ ❱ 579 │ │ │ module._apply(fn) │ │ 580 │ │ │ │ 581 │ │ def compute_should_use_set_data(tensor, tensor_applied): │ │ 582 │ │ │ if torch._has_compatible_shallow_copy_type(tensor, tensor │ │ │ │ /usr/local/conda/envs/llm_fine_tune/lib/python3.8/site-packages/torch/nn/mod │ │ ules/module.py:579 in _apply │ │ │ │ 576 │ │ │ 577 │ def _apply(self, fn): │ │ 578 │ │ for module in self.children(): │ │ ❱ 579 │ │ │ module._apply(fn) │ │ 580 │ │ │ │ 581 │ │ def compute_should_use_set_data(tensor, tensor_applied): │ │ 582 │ │ │ if torch._has_compatible_shallow_copy_type(tensor, tensor │ │ │ │ /usr/local/conda/envs/llm_fine_tune/lib/python3.8/site-packages/torch/nn/mod │ │ ules/module.py:579 in _apply │ │ │ │ 576 │ │ │ 577 │ def _apply(self, fn): │ │ 578 │ │ for module in self.children(): │ │ ❱ 579 │ │ │ module._apply(fn) │ │ 580 │ │ │ │ 581 │ │ def compute_should_use_set_data(tensor, tensor_applied): │ │ 582 │ │ │ if torch._has_compatible_shallow_copy_type(tensor, tensor │ │ │ │ /usr/local/conda/envs/llm_fine_tune/lib/python3.8/site-packages/torch/nn/mod │ │ ules/module.py:579 in _apply │ │ │ │ 576 │ │ │ 577 │ def _apply(self, fn): │ │ 578 │ │ for module in self.children(): │ │ ❱ 579 │ │ │ module._apply(fn) │ │ 580 │ │ │ │ 581 │ │ def compute_should_use_set_data(tensor, tensor_applied): │ │ 582 │ │ │ if torch._has_compatible_shallow_copy_type(tensor, tensor │ │ │ │ /usr/local/conda/envs/llm_fine_tune/lib/python3.8/site-packages/torch/nn/mod │ │ ules/module.py:602 in _apply │ │ │ │ 599 │ │ │ # track autograd history of param_applied, so we have t │ │ 600 │ │ │ # with torch.no_grad(): │ │ 601 │ │ │ with torch.no_grad(): │ │ ❱ 602 │ │ │ │ param_applied = fn(param) │ │ 603 │ │ │ should_use_set_data = compute_should_use_set_data(param, │ │ 604 │ │ │ if should_use_set_data: │ │ 605 │ │ │ │ param.data = param_applied │ │ │ │ /usr/local/conda/envs/llm_fine_tune/lib/python3.8/site-packages/torch/nn/mod │ │ ules/module.py:689 in │ │ │ │ 686 │ │ Returns: │ │ 687 │ │ │ Module: self │ │ 688 │ │ """ │ │ ❱ 689 │ │ return self._apply(lambda t: t.cuda(device)) │ │ 690 │ │ │ 691 │ def ipu(self: T, device: Optional[Union[int, device]] = None) -> │ │ 692 │ │ r"""Moves all model parameters and buffers to the IPU. │ ╰──────────────────────────────────────────────────────────────────────────────╯ RuntimeError: CUDA out of memory. Tried to allocate 128.00 MiB (GPU 0; 79.35 GiB total capacity; 4.65 GiB already allocated; 30.19 MiB free; 4.65 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF ╭───────────────────── Traceback (most recent call last) ──────────────────────╮ │ /work/glm_f │ │ ine/finetuning_pt.py:196 in │ │ │ │ 193 │ │ 194 │ │ 195 if name == "main": │ │ ❱ 196 │ main() │ │ 197 │ # CUDA_VISIBLE_DEVICES=1 deepspeed finetuning_pt.py │ │ 198 │ │ │ │ /work/glm_f │ │ ine/finetuning_pt.py:95 in main │ │ │ │ 92 │ │ │ 93 │ model = ChatGLMForConditionalGeneration.from_pretrained(args.model │ │ 94 │ tokenizer = ChatGLMTokenizer.from_pretrained(args.model_dir) │ │ ❱ 95 │ model = model.half().cuda() │ │ 96 │ if args.prefix_projection: │ │ 97 │ │ model.gradient_checkpointing_enable() │ │ 98 │ │ │ │ /usr/local/conda/envs/llm_fine_tune/lib/python3.8/site-packages/torch/nn/mod │ │ ules/module.py:689 in cuda │ │ │ │ 686 │ │ Returns: │ │ 687 │ │ │ Module: self │ │ 688 │ │ """ │ │ ❱ 689 │ │ return self._apply(lambda t: t.cuda(device)) │ │ 690 │ │ │ 691 │ def ipu(self: T, device: Optional[Union[int, device]] = None) -> │ │ 692 │ │ r"""Moves all model parameters and buffers to the IPU. │ │ │ │ /usr/local/conda/envs/llm_fine_tune/lib/python3.8/site-packages/torch/nn/mod │ │ ules/module.py:579 in _apply │ │ │ │ 576 │ │ │ 577 │ def _apply(self, fn): │ │ 578 │ │ for module in self.children(): │ │ ❱ 579 │ │ │ module._apply(fn) │ │ 580 │ │ │ │ 581 │ │ def compute_should_use_set_data(tensor, tensor_applied): │ │ 582 │ │ │ if torch._has_compatible_shallow_copy_type(tensor, tensor │ │ │ │ /usr/local/conda/envs/llm_fine_tune/lib/python3.8/site-packages/torch/nn/mod │ │ ules/module.py:579 in _apply │ │ │ │ 576 │ │ │ 577 │ def _apply(self, fn): │ │ 578 │ │ for module in self.children(): │ │ ❱ 579 │ │ │ module._apply(fn) │ │ 580 │ │ │ │ 581 │ │ def compute_should_use_set_data(tensor, tensor_applied): │ │ 582 │ │ │ if torch._has_compatible_shallow_copy_type(tensor, tensor │ │ │ │ /usr/local/conda/envs/llm_fine_tune/lib/python3.8/site-packages/torch/nn/mod │ │ ules/module.py:579 in _apply │ │ │ │ 576 │ │ │ 577 │ def _apply(self, fn): │ │ 578 │ │ for module in self.children(): │ │ ❱ 579 │ │ │ module._apply(fn) │ │ 580 │ │ │ │ 581 │ │ def compute_should_use_set_data(tensor, tensor_applied): │ │ 582 │ │ │ if torch._has_compatible_shallow_copy_type(tensor, tensor │ │ │ │ /usr/local/conda/envs/llm_fine_tune/lib/python3.8/site-packages/torch/nn/mod │ │ ules/module.py:579 in _apply │ │ │ │ 576 │ │ │ 577 │ def _apply(self, fn): │ │ 578 │ │ for module in self.children(): │ │ ❱ 579 │ │ │ module._apply(fn) │ │ 580 │ │ │ │ 581 │ │ def compute_should_use_set_data(tensor, tensor_applied): │ │ 582 │ │ │ if torch._has_compatible_shallow_copy_type(tensor, tensor │ │ │ │ /usr/local/conda/envs/llm_fine_tune/lib/python3.8/site-packages/torch/nn/mod │ │ ules/module.py:579 in _apply │ │ │ │ 576 │ │ │ 577 │ def _apply(self, fn): │ │ 578 │ │ for module in self.children(): │ │ ❱ 579 │ │ │ module._apply(fn) │ │ 580 │ │ │ │ 581 │ │ def compute_should_use_set_data(tensor, tensor_applied): │ │ 582 │ │ │ if torch._has_compatible_shallow_copy_type(tensor, tensor │ │ │ │ /usr/local/conda/envs/llm_fine_tune/lib/python3.8/site-packages/torch/nn/mod │ │ ules/module.py:602 in _apply │ │ │ │ 599 │ │ │ # track autograd history of param_applied, so we have t │ │ 600 │ │ │ # with torch.no_grad(): │ │ 601 │ │ │ with torch.no_grad(): │ │ ❱ 602 │ │ │ │ param_applied = fn(param) │ │ 603 │ │ │ should_use_set_data = compute_should_use_set_data(param, │ │ 604 │ │ │ if should_use_set_data: │ │ 605 │ │ │ │ param.data = param_applied │ │ │ │ /usr/local/conda/envs/llm_fine_tune/lib/python3.8/site-packages/torch/nn/mod │ │ ules/module.py:689 in │ │ │ │ 686 │ │ Returns: │ │ 687 │ │ │ Module: self │ │ 688 │ │ """ │ │ ❱ 689 │ │ return self._apply(lambda t: t.cuda(device)) │ │ 690 │ │ │ 691 │ def ipu(self: T, device: Optional[Union[int, device]] = None) -> │ │ 692 │ │ r"""Moves all model parameters and buffers to the IPU. │ ╰──────────────────────────────────────────────────────────────────────────────╯ RuntimeError: CUDA out of memory. Tried to allocate 96.00 MiB (GPU 0; 79.35 GiB total capacity; 5.27 GiB already allocated; 30.19 MiB free; 5.28 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF 2023-05-13 23:59:13,656 - chatglm_finetune - INFO -finetuning_pt: 0 2023-05-13 23:59:13,660 - chatglm_finetune - INFO -finetuning_pt: start to load model. 2023-05-13 23:59:15,297 - chatglm_finetune - INFO -finetuning_pt: 0 2023-05-13 23:59:15,298 - chatglm_finetune - INFO -finetuning_pt: start to load model. 2023-05-13 23:59:16,608 - chatglm_finetune - INFO -finetuning_pt: 0 2023-05-13 23:59:16,613 - chatglm_finetune - INFO -finetuning_pt: start to load model. `

this is the result after "nvidia-smi":

+-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+`

thanks for your help!

May 19 '23 08:05 hgtttttt

Hi @hgtttttt, thanks for the code but could you please provide one that is runnable via a simple copy and paste. Also it appears from the stacktrace that the failure comes before deepspeed.initialize() at the line model.half().cuda(). You don't need to manually convert the model to fp16 DeepSpeed will take care of that internally. Try removing that line and running.

May 23 '23 00:05 jomayeri

@jomayeri Yes it works! I deleted this line of code and the error disappeared. So maybe there are some conflicts when I manually convert the model to fp16, which only appears when I use x8 A100s. All in all, I'm extremely grateful for your assistance!

May 23 '23 08:05 hgtttttt