ControlNet icon indicating copy to clipboard operation
ControlNet copied to clipboard

CUDA out of Memory with Callbacks

Open paudom opened this issue 2 years ago • 2 comments

Hi!

I'm trying to train a model using my own data. I followed the tutorial for the data in order to achieve the correct format for source, target and the prompts.json. The data I use is 512x512. I have a machine with 16GB GPU, and the training starts without problem until the logger frequency from the image logger and the checkpointing frequency match. Then the training stops and raises the following:

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 512.00 MiB (GPU 0; 14.76 GiB total capacity; 12.84 Gi B already allocated; 401.75 MiB free; 13.33 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

This is my training code:

from share import *

import pytorch_lightning as pl
from pytorch_lightning.callbacks import ModelCheckpoint
from torch.utils.data import DataLoader
from control_dataset import MyDataset
from cldm.logger import ImageLogger
from cldm.model import create_model, load_state_dict

# Configs
resume_path = './models/path_to_model.ckpt'
data_root = 'path_to_dataset_with_source_target_and_prompts'
batch_size = 2
train_name = 'training_name'
logger_freq = 1000
checkpoint_freq = 1000
learning_rate = 1e-5
epochs = 2
sd_locked = True
only_mid_control = False

# First use cpu to load models. Pytorch Lightning will automatically move it to GPUs.
model = create_model('./models/cldm_v15.yaml').cpu()
model.load_state_dict(load_state_dict(resume_path, location='cpu'))
model.learning_rate = learning_rate
model.sd_locked = sd_locked
model.only_mid_control = only_mid_control

# Misc
dataset = MyDataset(data_root=data_root)
dataloader = DataLoader(dataset, num_workers=0, batch_size=batch_size, shuffle=True)
logger = ImageLogger(batch_frequency=logger_freq)
checkpointer = ModelCheckpoint(
    dirpath=f'checkpoints/{train_name}',
    every_n_train_steps=checkpoint_freq,
    save_last=True,
    save_weights_only=True
)
trainer = pl.Trainer(
    gpus=1,
    precision=32,
    accumulate_grad_batches=2,
    callbacks=[logger, checkpointer],
    max_epochs=epochs,
)

# Train!
trainer.fit(model, dataloader)

When the training arrives at the step 1000, then the image logging works as expected sampling 50 samples:

Data shape for DDIM sampling is (2, 4, 64, 64), eta 0.0 Running DDIM Sampling with 50 timesteps DDIM Sampler: 100%|███████████████████████████████████| 50/50 [00:54<00:00, 1.09s/it] Epoch 0: 4%| | 1000/25007 [55:58<22:23:39, 3.36s/it, loss=0.156, v_num=0, train/loss_simple_step=0.130, train/loss_vl

But at this stage the cuda out of memory error ocurres. I am doing something wrong with the ModelCheckpoint callback? I have set save_memory=True but nothing changes.

Any idea why this happens?

paudom avatar Mar 07 '23 13:03 paudom

I am facing the same issue. Is there any way to solve this problem. I am using 12GB GPU to train the network

engrmusawarali avatar Mar 08 '23 11:03 engrmusawarali

拉屎擦屁股是资本主义骗局吗?

Dream-Nie avatar Apr 08 '23 08:04 Dream-Nie

all duplicate concerning "RAM and out of memory exceptions (OOM)": https://github.com/lllyasviel/ControlNet/issues/21 https://github.com/lllyasviel/ControlNet/issues/33 https://github.com/lllyasviel/ControlNet/issues/191 https://github.com/lllyasviel/ControlNet/issues/236 https://github.com/lllyasviel/ControlNet/issues/241 https://github.com/lllyasviel/ControlNet/issues/247 https://github.com/lllyasviel/ControlNet/issues/294 https://github.com/lllyasviel/ControlNet/issues/301

geroldmeisinger avatar Sep 17 '23 10:09 geroldmeisinger