stable-diffusion.cpp Request: reduce memory usage for text2img

Is it possible to reduce the memory usage from spiking to 15ish GB when doing text2img? I'm currently following this guide and using the default cat prompt on leejet's q4_k and q2_k flux schnell model. Same behaviour for his q2_k model. The guide's link to the vae safetensor is inaccessible for me as I'm not part of flux-dev but I used the official black-forest-labs vae matrix.

Memory can spike up to 15ish GB before settling at 6 or 4 GB.

Using --vae-tiling flag lowers the spike to 12.95 GB. I'm not aware of any other options to further reduce memory consumption though.

For metal q2_k, I still see the 12.95 GB mem spike in activity monitor.

[DEBUG] ggml_extend.hpp:1029 - clip params backend buffer size =  235.06 MB(RAM) (196 tensors)
[DEBUG] ggml_extend.hpp:1029 - t5 params backend buffer size =  9083.77 MB(RAM) (219 tensors)
[DEBUG] ggml_extend.hpp:1029 - flux params backend buffer size =  3732.51 MB(VRAM) (776 tensors)
[DEBUG] ggml_extend.hpp:1029 - vae params backend buffer size =  94.57 MB(VRAM) (138 tensors)
...
[INFO ] stable-diffusion.cpp:486  - total params memory size = 13145.92MB (VRAM 3827.08MB, RAM 9318.83MB): clip 9318.83MB(RAM), unet 3732.51MB(VRAM), vae 94.57MB(VRAM), controlnet 0.00MB(VRAM), pmid 0.00MB(RAM)

Similarly for cpu q2_k:

[DEBUG] ggml_extend.hpp:1029 - clip params backend buffer size =  235.06 MB(RAM) (196 tensors)
[DEBUG] ggml_extend.hpp:1029 - t5 params backend buffer size =  9083.77 MB(RAM) (219 tensors)
[DEBUG] ggml_extend.hpp:1029 - flux params backend buffer size =  3732.51 MB(RAM) (776 tensors)
[DEBUG] ggml_extend.hpp:1029 - vae params backend buffer size =  94.57 MB(RAM) (138 tensors)
...
[INFO ] stable-diffusion.cpp:486  - total params memory size = 13145.92MB (VRAM 0.00MB, RAM 13145.92MB): clip 9318.83MB(RAM), unet 3732.51MB(RAM), vae 94.57MB(RAM), controlnet 0.00MB(VRAM), pmid 0.00MB(RAM)

It would be great if memory usage tops out at under 8GB thanks! EDIT: More information:

I'm on commit 8847114abfd900898e78d0257f5f9086f2473601

Date:   Sun Aug 25 22:39:39 2024 +0800

    fix: fix issue when applying lora

I built stable-diffusion.cpp with: cmake -G Ninja -DSD_METAL=ON .. && cmake --build . (looks like release is the default)
Ran with the sample guide commands: ./bin/sd --vae ~/work/models/stable-diffusion/diffusion_pytorch_model.safetensors --clip_l ~/work/models/stable-diffusion/clip_l.safetensors --t5xxl ~/work/models/stable-diffusion/t5xxl_fp16.safetensors -p "a lovely cat holding a sign says 'flux.cpp'" --cfg-scale 1.0 --sampling-method euler -v --diffusion-model ~/work/models/stable-diffusion/flux1-schnell-q4_k.gguf
my machine: Mac Studio M2 Ultra with 24 CPU cores and 64GB unified ram on Sonoma 14.6.1 (23G93)

Aug 27 '24 19:08 xiaogz

I used a q8 version of the clip that @Green-Sky uploaded on hugging face , i am not sure how this affects quality but it lowered ram usage when loading the model

Aug 27 '24 21:08 SenninOne

Since the text encoder is running on cpu, the actual VRAM used is less than 4G, in the log you posted.

Aug 28 '24 00:08 leejet

Sry I meant to clarify is it possible to reduce cpu RAM usage? VRAM usage is definitely under 4GB yes but RAM usage is quite high. Even with video memory sharing some load RAM usage is >8GB.

Aug 30 '24 22:08 xiaogz

I think there are a couple of things here.

the model is loaded to ram and then copied from ram to vram. in that moment it is loaded 2 times on the device.
sd.cpp uses im2col to convert convolution to matmul computations. this is very space inefficent. without looking at the actual code, i have read that it can result in 80% more (compute) memory usage.

Aug 31 '24 05:08 Green-Sky

Since the text encoder is running on cpu, the actual VRAM used is less than 4G, in the log you posted.Since the text encoder is running on cpu, the actual VRAM used is less than 4G, in the log you posted.

I want to know if I can set it up by myself so that CLIP and T5 can run on VRAM?

Oct 28 '24 06:10 CrushDemo01