CUDA out of Memory with Callbacks
Hi!
I'm trying to train a model using my own data. I followed the tutorial for the data in order to achieve the correct format for source, target and the prompts.json. The data I use is 512x512. I have a machine with 16GB GPU, and the training starts without problem until the logger frequency from the image logger and the checkpointing frequency match. Then the training stops and raises the following:
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 512.00 MiB (GPU 0; 14.76 GiB total capacity; 12.84 Gi B already allocated; 401.75 MiB free; 13.33 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
This is my training code:
from share import *
import pytorch_lightning as pl
from pytorch_lightning.callbacks import ModelCheckpoint
from torch.utils.data import DataLoader
from control_dataset import MyDataset
from cldm.logger import ImageLogger
from cldm.model import create_model, load_state_dict
# Configs
resume_path = './models/path_to_model.ckpt'
data_root = 'path_to_dataset_with_source_target_and_prompts'
batch_size = 2
train_name = 'training_name'
logger_freq = 1000
checkpoint_freq = 1000
learning_rate = 1e-5
epochs = 2
sd_locked = True
only_mid_control = False
# First use cpu to load models. Pytorch Lightning will automatically move it to GPUs.
model = create_model('./models/cldm_v15.yaml').cpu()
model.load_state_dict(load_state_dict(resume_path, location='cpu'))
model.learning_rate = learning_rate
model.sd_locked = sd_locked
model.only_mid_control = only_mid_control
# Misc
dataset = MyDataset(data_root=data_root)
dataloader = DataLoader(dataset, num_workers=0, batch_size=batch_size, shuffle=True)
logger = ImageLogger(batch_frequency=logger_freq)
checkpointer = ModelCheckpoint(
dirpath=f'checkpoints/{train_name}',
every_n_train_steps=checkpoint_freq,
save_last=True,
save_weights_only=True
)
trainer = pl.Trainer(
gpus=1,
precision=32,
accumulate_grad_batches=2,
callbacks=[logger, checkpointer],
max_epochs=epochs,
)
# Train!
trainer.fit(model, dataloader)
When the training arrives at the step 1000, then the image logging works as expected sampling 50 samples:
Data shape for DDIM sampling is (2, 4, 64, 64), eta 0.0 Running DDIM Sampling with 50 timesteps DDIM Sampler: 100%|███████████████████████████████████| 50/50 [00:54<00:00, 1.09s/it] Epoch 0: 4%| | 1000/25007 [55:58<22:23:39, 3.36s/it, loss=0.156, v_num=0, train/loss_simple_step=0.130, train/loss_vl
But at this stage the cuda out of memory error ocurres. I am doing something wrong with the ModelCheckpoint callback?
I have set save_memory=True but nothing changes.
Any idea why this happens?
I am facing the same issue. Is there any way to solve this problem. I am using 12GB GPU to train the network
拉屎擦屁股是资本主义骗局吗?
all duplicate concerning "RAM and out of memory exceptions (OOM)": https://github.com/lllyasviel/ControlNet/issues/21 https://github.com/lllyasviel/ControlNet/issues/33 https://github.com/lllyasviel/ControlNet/issues/191 https://github.com/lllyasviel/ControlNet/issues/236 https://github.com/lllyasviel/ControlNet/issues/241 https://github.com/lllyasviel/ControlNet/issues/247 https://github.com/lllyasviel/ControlNet/issues/294 https://github.com/lllyasviel/ControlNet/issues/301