StableCascade Always get the Cuda Out Of Memory error when training LoRA despite fixing the batch

Here is my logs:

STARTIG JOB WITH CONFIG: adaptive_loss_weight: null allow_tf32: true backup_every: 1000 batch_size: 4 bucketeer_random_ratio: 0.05 captions_getter: null checkpoint_extension: safetensors checkpoint_path: output clip_image_model_name: openai/clip-vit-large-patch14 clip_text_model_name: laion/CLIP-ViT-bigG-14-laion2B-39B-b160k dataset_filters: null dist_file_subfolder: '' dtype: null effnet_checkpoint_path: models/effnet_encoder.safetensors ema_beta: null ema_iters: null ema_start_iters: null experiment_id: stage_c_3b_lora generator_checkpoint_path: models/stage_c_bf16.safetensors grad_accum_steps: 4 image_size: 768 lora_checkpoint_path: null lr: 0.0001 model_version: 3.6B module_filters:

.attn multi_aspect_ratio:
1/1
1/2
1/3
2/3
3/4
1/5
2/5
3/5
4/5
1/6
5/6
9/16 output_path: output previewer_checkpoint_path: models/previewer.safetensors rank: 4 save_every: 100 train_tokens:
- '[fernando]'
- ^dog training: true updates: 10000 use_fsdp: false wandb_entity: quocanh34 wandb_project: StableCascade warmup_updates: 1 webdataset_path: file:data/fernando.tar

INFO: adaptive_loss: null ema_loss: null iter: 0 total_steps: 0 train_tokens: null wandb_run_id: 7spfifem

['transforms', 'clip_preprocess', 'gdf', 'sampling_configs', 'effnet_preprocess'] Training with batch size 4 (1/GPU) ['dataset', 'dataloader', 'iterator'] DATA: dataloader: DataLoader dataset: WebDataset iterator: Bucketeer training: NoneType

Loading checkpoint shards: 0%| | 0/2 [00:00<?, ?it/s]/usr/local/lib/python3.10/dist-packages/torch/_utils.py:831: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage() return self.fget.get(instance, owner)() Loading checkpoint shards: 100%|██████████████████| 2/2 [00:07<00:00, 3.88s/it] Updating tokens: [(49408, '[fernando]')] LoRA training 128 layers ['tokenizer', 'text_model', 'generator', 'effnet', 'previewer', 'lora'] MODELS: effnet: EfficientNetEncoder - trainable params 0 generator: StageC - trainable params 3592249360 generator_ema: NoneType - Not a nn.Module image_model: CLIPVisionModelWithProjection - trainable params 0 lora: ModuleDict - trainable params 3147008 previewer: Previewer - trainable params 0 text_model: CLIPTextModelWithProjection - trainable params 1280 tokenizer: CLIPTokenizerFast - Not a nn.Module training: NoneType - Not a nn.Module

['lora'] OPTIMIZERS: generator: NoneType lora: AdamW training: NoneType

[] SCHEDULERS: lora: GradualWarmupScheduler training: NoneType

['transforms', 'clip_preprocess', 'gdf', 'sampling_configs', 'effnet_preprocess'] ['transforms', 'clip_preprocess', 'gdf', 'sampling_configs', 'effnet_preprocess'] EXTRAS: clip_preprocess: "Compose(\n Resize(size=224, interpolation=bicubic, max_size=None,
\ antialias=warn)\n CenterCrop(size=(224, 224))\n Normalize(mean=(0.48145466,
\ 0.4578275, 0.40821073), std=(0.26862954, 0.26130258, 0.27577711))\n)" effnet_preprocess: "Compose(\n Normalize(mean=(0.485, 0.456, 0.406), std=(0.229,
\ 0.224, 0.225))\n)" gdf: <gdf.GDF object at 0x7f40dd905d80> sampling_configs: '{''cfg'': 5, ''sampler'': <gdf.samplers.DDPMSampler object at 0x7f40dd907430>, ''shift'': 1, ''timesteps'': 20}' training: None transforms: "Compose(\n ToTensor()\n Resize(size=768, interpolation=bilinear,
\ max_size=None, antialias=True)\n SmartCrop(\n (saliency_model): MicroResNet(\n
\ (downsampler): Sequential(\n (0): ReflectionPad2d((4, 4, 4, 4))\n
\ (1): Conv2d(3, 8, kernel_size=(9, 9), stride=(4, 4))\n (2): InstanceNorm2d(8,
\ eps=1e-05, momentum=0.1, affine=True, track_running_stats=False)\n (3): ReLU()\n
\ (4): ReflectionPad2d((1, 1, 1, 1))\n (5): Conv2d(8, 16, kernel_size=(3,
\ 3), stride=(2, 2))\n (6): InstanceNorm2d(16, eps=1e-05, momentum=0.1, affine=True,
\ track_running_stats=False)\n (7): ReLU()\n (8): ReflectionPad2d((1,
\ 1, 1, 1))\n (9): Conv2d(16, 32, kernel_size=(3, 3), stride=(2, 2))\n
\ (10): InstanceNorm2d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=False)\n
\ (11): ReLU()\n )\n (residual): Sequential(\n (0): ResBlock(\n
\ (resblock): Sequential(\n (0): ReflectionPad2d((1, 1, 1, 1))\n
\ (1): Conv2d(32, 32, kernel_size=(3, 3), stride=(1, 1))\n (2):
\ InstanceNorm2d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=False)\n
\ (3): ReLU()\n (4): ReflectionPad2d((1, 1, 1, 1))\n
\ (5): Conv2d(32, 32, kernel_size=(3, 3), stride=(1, 1))\n (6): InstanceNorm2d(32,
\ eps=1e-05, momentum=0.1, affine=True, track_running_stats=False)\n )\n
\ )\n (1): Conv2d(32, 64, kernel_size=(1, 1), stride=(1, 1), groups=32,
\ bias=False)\n (2): ResBlock(\n (resblock): Sequential(\n
\ (0): ReflectionPad2d((1, 1, 1, 1))\n (1): Conv2d(64, 64, kernel_size=(3,
\ 3), stride=(1, 1))\n (2): InstanceNorm2d(64, eps=1e-05, momentum=0.1,
\ affine=True, track_running_stats=False)\n (3): ReLU()\n (4):
\ ReflectionPad2d((1, 1, 1, 1))\n (5): Conv2d(64, 64, kernel_size=(3, 3),
\ stride=(1, 1))\n (6): InstanceNorm2d(64, eps=1e-05, momentum=0.1, affine=True,
\ track_running_stats=False)\n )\n )\n )\n (segmentator): Sequential(\n
\ (0): ReflectionPad2d((1, 1, 1, 1))\n (1): Conv2d(64, 16, kernel_size=(3,
\ 3), stride=(1, 1))\n (2): InstanceNorm2d(16, eps=1e-05, momentum=0.1, affine=True,
\ track_running_stats=False)\n (3): ReLU()\n (4): Upsample2d()\n
\ (5): ReflectionPad2d((4, 4, 4, 4))\n (6): Conv2d(16, 1, kernel_size=(9, 9),
\ stride=(1, 1))\n (7): Sigmoid()\n )\n )\n)\n)"

TRAINING STARTING... STARTING AT STEP: 1/40000 0%| | 0/40000 [00:00<?, ?it/s] Traceback (most recent call last): File "/workspace/StableCascade/train/train_c_lora.py", line 332, in warpcore(single_gpu=True) File "/workspace/StableCascade/./core/init.py", line 360, in call self.train(data, extras, models, optimizers, schedulers) File "/workspace/StableCascade/./train/base.py", line 254, in train loss, loss_adjusted = self.forward_pass(data, extras, models) File "/workspace/StableCascade/train/train_c_lora.py", line 275, in forward_pass pred = models.generator(noised, noise_cond, **conditions) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, **kwargs) File "/workspace/StableCascade/./modules/stage_c.py", line 244, in forward level_outputs = self._down_encode(x, r_embed, clip, cnet) File "/workspace/StableCascade/./modules/stage_c.py", line 186, in _down_encode x = block(x, clip) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, **kwargs) File "/workspace/StableCascade/./modules/common.py", line 85, in forward x = x + self.attention(self.norm(x), kv, self_attn=self.self_attn) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, **kwargs) File "/workspace/StableCascade/./modules/common.py", line 23, in forward x = self.attn(x, kv, kv, need_weights=False)[0] File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/activation.py", line 1243, in forward self.in_proj_weight, self.in_proj_bias, File "/usr/local/lib/python3.10/dist-packages/torch/nn/utils/parametrize.py", line 369, in get_parametrized return parametrization() File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/utils/parametrize.py", line 266, in forward x = self0 File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, **kwargs) File "/workspace/StableCascade/./modules/lora.py", line 20, in forward return original_weights + lora_weights torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 48.00 MiB. GPU 0 has a total capacty of 23.65 GiB of which 47.19 MiB is free. Process 1577312 has 23.59 GiB memory in use. Of the allocated memory 23.01 GiB is allocated by PyTorch, and 124.59 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF`

Feb 16 '24 12:02 apluka34

Per this: https://github.com/Stability-AI/StableCascade/issues/26

You likely need to downsize to the 1B Model. Update your config:

model_version: 1B

and

generator_checkpoint_path: models/stage_c_lite_bf16.safetensors

Feb 16 '24 21:02 asutermo

Thanks mate

Feb 16 '24 23:02 apluka34

Always get the Cuda Out Of Memory error when training LoRA despite fixing the batch_size