Add vulkan backend
issue: https://github.com/leejet/stable-diffusion.cpp/issues/256
Looks like theyre doing some changes to vulkan shader generation in ggml repo, and its currently broken. Will keep and eye on it and update the pr accordingly.
Related issue: https://github.com/ggerganov/llama.cpp/issues/5356
(im new to this, so I might have made some mistakes. I would be grateful for any guidance or feedback)
Hey, nice to see someone working on this. I'd like to get this to work. There's probably some ops that need to be supported by Vulkan upstream, right? I can help with that.
@0cc4m Thanks for offering help.
Currently the hpp file generated by ggml_vk_generate_shaders.py does not have types like mul_mat_vec_id_q3_k_f32_len, div_f32_len etc
Also some types were renamed eg: dequant_q5_k_len is imported in ggml/src/ggml-vulkan.cpp but header file has dequant_q5_K_len
Im assuming these issues will be solved by your work in llama.cpp? please correct me if Im wrong
Also let me know if I can help with anything
@0cc4m Thanks for offering help.
Currently the hpp file generated by
ggml_vk_generate_shaders.pydoes not have types likemul_mat_vec_id_q3_k_f32_len,div_f32_lenetcAlso some types were renamed eg:
dequant_q5_k_lenis imported inggml/src/ggml-vulkan.cppbut header file hasdequant_q5_K_lenIm assuming these issues will be solved by your work in llama.cpp? please correct me if Im wrong
Also let me know if I can help with anything
It is working in Llama.cpp. I'll take a look at the status in ggml, maybe that needs an update.
I manually wired up Vulkan and compiled SD.cpp with the latest ggml modified with llama.cpp's modifications to Vulkan. It runs and loads a model, but their Vulkan shaders do not implement CONCAT and it fails.
./sd -m ~/Desktop/Misc/stable_diffusion/a1111/models/Stable-diffusion/ponyDiffusionV6XL_v6.safetensors --prompt "score_9, score_8_up, score_7_up, score_6_up, score_5_up, rainbow dash" -W 1024 -H 1024 -v
Option:
n_threads: 8
mode: txt2img
model_path: /home/david/Desktop/Misc/stable_diffusion/a1111/models/Stable-diffusion/ponyDiffusionV6XL_v6.safetensors
wtype: unspecified
vae_path:
taesd_path:
esrgan_path:
controlnet_path:
embeddings_path:
stacked_id_embeddings_path:
input_id_images_path:
style ratio: 20.00
normzalize input image : false
output_path: output.png
init_img:
control_image:
clip on cpu: false
controlnet cpu: false
vae decoder on cpu:false
strength(control): 0.90
prompt: score_9, score_8_up, score_7_up, score_6_up, score_5_up, rainbow dash
negative_prompt:
min_cfg: 1.00
cfg_scale: 7.00
clip_skip: -1
width: 1024
height: 1024
sample_method: euler_a
schedule: default
sample_steps: 20
strength(img2img): 0.75
rng: cuda
seed: 42
batch_count: 1
vae_tiling: false
upscale_repeats: 1
System Info:
BLAS = 1
SSE3 = 1
AVX = 1
AVX2 = 1
AVX512 = 1
AVX512_VBMI = 1
AVX512_VNNI = 1
FMA = 1
NEON = 0
ARM_FMA = 0
F16C = 1
FP16_VA = 0
WASM_SIMD = 0
VSX = 0
[DEBUG] stable-diffusion.cpp:158 - Using Vulkan backend
ggml_vulkan: Found 1 Vulkan devices:
Vulkan0: NVIDIA RTX A4000 Laptop GPU (NVIDIA) | uma: 0 | fp16: 1 | warp size: 32
[INFO ] stable-diffusion.cpp:178 - loading model from '/home/david/Desktop/Misc/stable_diffusion/a1111/models/Stable-diffusion/ponyDiffusionV6XL_v6.safetensors'
[INFO ] model.cpp:737 - load /home/david/Desktop/Misc/stable_diffusion/a1111/models/Stable-diffusion/ponyDiffusionV6XL_v6.safetensors using safetensors format
[DEBUG] model.cpp:803 - init from '/home/david/Desktop/Misc/stable_diffusion/a1111/models/Stable-diffusion/ponyDiffusionV6XL_v6.safetensors'
[INFO ] stable-diffusion.cpp:201 - Stable Diffusion XL
[INFO ] stable-diffusion.cpp:207 - Stable Diffusion weight type: f16
[DEBUG] stable-diffusion.cpp:208 - ggml tensor size = 400 bytes
[WARN ] stable-diffusion.cpp:213 - !!!It looks like you are using SDXL model. If you find that the generated images are completely black, try specifying SDXL VAE FP16 Fix with the --vae parameter. You can find it here: https://huggingface.co/madebyollin/sdxl-vae-fp16-fix/blob/main/sdxl_vae.safetensors
[DEBUG] ggml_extend.hpp:884 - clip params backend buffer size = 1564.36 MB(VRAM) (713 tensors)
[DEBUG] ggml_extend.hpp:884 - unet params backend buffer size = 4900.07 MB(VRAM) (1680 tensors)
[DEBUG] ggml_extend.hpp:884 - vae params backend buffer size = 94.47 MB(VRAM) (140 tensors)
[DEBUG] stable-diffusion.cpp:309 - loading vocab
[DEBUG] clip.hpp:164 - vocab size: 49408
[DEBUG] clip.hpp:175 - trigger word img already in vocab
[DEBUG] stable-diffusion.cpp:329 - loading weights
[DEBUG] model.cpp:1380 - loading tensors from /home/david/Desktop/Misc/stable_diffusion/a1111/models/Stable-diffusion/ponyDiffusionV6XL_v6.safetensors
[INFO ] stable-diffusion.cpp:413 - total params memory size = 6558.89MB (VRAM 6558.89MB, RAM 0.00MB): clip 1564.36MB(VRAM), unet 4900.07MB(VRAM), vae 94.47MB(VRAM), controlnet 0.00MB(VRAM), pmid 0.00MB(VRAM)
[INFO ] stable-diffusion.cpp:432 - loading model from '/home/david/Desktop/Misc/stable_diffusion/a1111/models/Stable-diffusion/ponyDiffusionV6XL_v6.safetensors' completed, taking 4.34s
[INFO ] stable-diffusion.cpp:449 - running in eps-prediction mode
[DEBUG] stable-diffusion.cpp:482 - finished loaded file
[DEBUG] stable-diffusion.cpp:1452 - txt2img 1024x1024
[DEBUG] stable-diffusion.cpp:1207 - prompt after extract and remove lora: "score_9, score_8_up, score_7_up, score_6_up, score_5_up, rainbow dash"
[INFO ] stable-diffusion.cpp:565 - Attempting to apply 0 LoRAs
[INFO ] stable-diffusion.cpp:1212 - apply_loras completed, taking 0.00s
[DEBUG] clip.hpp:1312 - parse 'score_9, score_8_up, score_7_up, score_6_up, score_5_up, rainbow dash' to [['score_9, score_8_up, score_7_up, score_6_up, score_5_up, rainbow dash', 1], ]
[DEBUG] clip.hpp:1152 - token length: 77
[DEBUG] ggml_extend.hpp:838 - clip compute buffer size: 2.56 MB(VRAM)
ggml_vulkan: Error: Missing op: CONCAT
GGML_ASSERT: /home/david/Desktop/Dev/ggml/stable-diffusion.cpp/ggml/src/ggml-vulkan.cpp:5533: false
Aborted (core dumped)
After adding CONCAT to the relevant place (probably not the solution for that?), it makes it a little further but still fails here:
ggml_backend_vk_graph_compute: error: op not supported (view) (UNARY)
GGML_ASSERT: /home/david/Desktop/Dev/ggml/stable-diffusion.cpp/ggml/src/ggml-vulkan.cpp:6227: ok
At this point it's beyond my knowledge/skill.
@Cloudwalk9 Thank you for trying it, I can add the missing ops. Can you upload your progress to a branch that I can access?
@0cc4m Done, but it's pretty crude. I updated the submodule to point to my fork of ggml with the imported Vulkan stuff, also had to fix some headers. https://github.com/Cloudwalk9/stable-diffusion.cpp
@0cc4m They just synced the newer Vulkan shader code (split into individual files) from llama.cpp to upstream ggml, so you could probably target ggml directly, instead of my forked submodule.
@0cc4m They just synced the newer Vulkan shader code (split into individual files) from llama.cpp to upstream ggml, so you could probably target ggml directly, instead of my forked submodule.
Yeah, my WIP branch is here: https://github.com/0cc4m/ggml/tree/vulkan-stable-diffusion-ops
I implemented all the ops, but there's still some bug that makes the image not adhere to the prompt. I'll investigate that later.
@0cc4m They just synced the newer Vulkan shader code (split into individual files) from llama.cpp to upstream ggml, so you could probably target ggml directly, instead of my forked submodule.
Yeah, my WIP branch is here: https://github.com/0cc4m/ggml/tree/vulkan-stable-diffusion-ops
I implemented all the ops, but there's still some bug that makes the image not adhere to the prompt. I'll investigate that later.
Great work, thank you!
Some ops appear to still be missing when I try to use LoRA (res-adapter):
lora.hpp:67 - finished loaded lora`
lora.hpp:175 - (18 / 18) LoRA tensors applied successfully
ggml_extend.hpp:841 - lora compute buffer size: 112.85 MB(VRAM)
lora.hpp:175 - (18 / 18) LoRA tensors applied successfully
ggml_vulkan: Error: Missing op: ADD for f16 and f32 to f16
D:\a\stable-diffusion.cpp\stable-diffusion.cpp\ggml\src\ggml-vulkan.cpp:4149: fatal error
A different error occurs when I try to use TAESD:
stable-diffusion.cpp:1398 - generating 1 latent images completed, taking 46.07s
stable-diffusion.cpp:1401 - decoding 1 latents
ggml_extend.hpp:841 - taesd compute buffer size: 480.00 MB(VRAM)
ggml_backend_vk_graph_compute: error: op not supported (view) (UNARY)
D:\a\stable-diffusion.cpp\stable-diffusion.cpp\ggml\src\ggml-vulkan.cpp:6432: GGML_ASSERT(ok) failed
We're finally about to see Stable Diffusion where the only major dependency is your graphics driver...
@SkutteOleg Thank you, those should be easy to add. I fixed the first bug that caused issues, but I ran into another matmul bug that I have to find in the shader code. I hope I can find it soon.
LORA and TAESD should work now. I also fixed the matmul bug. It's generating images correctly in my tests, but not that fast yet.
LORA and TAESD should work now. I also fixed the matmul bug. It's generating images correctly in my tests, but not that fast yet.
It is amazing, actually. It's 2.5 times faster than CUDA12 on my end 😲 (perhaps due to lower memory usage, but i'm not sure)
LORA and TAESD should work now. I also fixed the matmul bug. It's generating images correctly in my tests, but not that fast yet.
It is amazing, actually. It's 2.5 times faster than CUDA12 on my end 😲 (perhaps due to lower memory usage, but i'm not sure)
On which hardware?
On which hardware?
NVIDIA GeForce GTX 1660 SUPER
EDIT: Also confirmed working reasonably fast on Steam Deck.
It's 2.5 times faster than CUDA12 on my end 😲 (perhaps due to lower memory usage, but i'm not sure)
I had time to do some further testing. Apparently I was comparing the speed to a previous build of sd.cpp. It turns out CUDA12 image generation speed also got faster after ggml update. Even still, Vulkan is 20% faster. However, I was wrong about memory. It appears that Vulkan uses more memory as I can no longer fit both llama.cpp and stable-diffusion.cpp on the GPU at the same time.
UPD: I was testing at 512x512 before. When trying 1024x1024 Vulkan is indeed 15% slower for me. Also, at 1024x1024 it produces broken outputs on my hardware:
LORA and TAESD should work now. I also fixed the matmul bug. It's generating images correctly in my tests, but not that fast yet.
Excellent work, for me works fine, tested with intel ARC a580
UPD: I was testing at 512x512 before. When trying 1024x1024 Vulkan is indeed 15% slower for me. Also, at 1024x1024 it produces broken outputs on my hardware.
This is a problem with a very large buffer that sd.cpp requests for VAE decoding (?). I cannot fix that on the Vulkan side, but I am throwing an exception now so that it crashes instead of just generating garbage output. Maybe @leejet can think of a solution? Vulkan has a restriction on how large VRAM buffers can be (usually 4GB), and 1024x1024 VAE decoding requests a buffer larger than that.
there should be VAE-tiling available, or fallback to cpu (not exposed as a cli option afaik).
This is a problem with a very large buffer that sd.cpp requests for VAE decoding (?). I cannot fix that on the Vulkan side, but I am throwing an exception now so that it crashes instead of just generating garbage output. Maybe @leejet can think of a solution? Vulkan has a restriction on how large VRAM buffers can be (usually 4GB), and 1024x1024 VAE decoding requests a buffer larger than that.
Shouldn't VAE tiling help with that? This occurs for me even with VAE tiling enabled.
Excellent work, well done. Pictures are generated at 384x384 on my Intel i5-1035G1.
Using the --vae-on-cpu option it will do 512x512 images. Don't understand why VAE should be such a problem, the compute buffer size is 1.6GB in ram.
Tried the vulkan repo from Skuttle, vulkan sdcpp -> 2.12 it/s cuda sdcpp -> 3.95 it/s comfyui -> 1.27 it/s
Nvidia gtx 1650 ti mobile Fedora 40
nearly identical images, though why are some patches different b/w cuda and vulkan?
This is a problem with a very large buffer that sd.cpp requests for VAE decoding (?). I cannot fix that on the Vulkan side, but I am throwing an exception now so that it crashes instead of just generating garbage output. Maybe @leejet can think of a solution? Vulkan has a restriction on how large VRAM buffers can be (usually 4GB), and 1024x1024 VAE decoding requests a buffer larger than that.
Shouldn't VAE tiling help with that? This occurs for me even with VAE tiling enabled.
It should, and it does in my tests. I can generate 1024x1024 images with SDXL by using --vae-tiling or --vae-on-cpu.
why are some patches different b/w cuda and vulkan?
There are slight differences in how the CUDA and Vulkan backends calculate, for example the CUDA backend uses tensor cores for matrix multiplication, while the Vulkan backend (on Nvidia GPUs) uses the regular CUDA cores. That can change the results slightly. There might also be some minor differences in other operations that contribute to that, too.
I tried the img2img mode but it immediately raises an error ggml_vulkan: Error: Missing op: PAD
I tried the img2img mode but it immediately raises an error ggml_vulkan: Error: Missing op: PAD
Thank you for reporting that, I forgot to check img2img. It should work now.
When trying to load any embedding I get this missing vulkan operator:
ggml_vulkan: Error: Missing op: CONCAT for f16 and f16 to f16
When trying to load any embedding I get this missing vulkan operator:
ggml_vulkan: Error: Missing op: CONCAT for f16 and f16 to f16
I can implement that, but it's odd considering that f16 CONCAT is not even implemented for CPU or CUDA. Do embeddings work with those?