stable-diffusion.cpp Add vulkan backend

issue: https://github.com/leejet/stable-diffusion.cpp/issues/256

Looks like theyre doing some changes to vulkan shader generation in ggml repo, and its currently broken. Will keep and eye on it and update the pr accordingly.

Jun 18 '24 22:06 sohzm

Related issue: https://github.com/ggerganov/llama.cpp/issues/5356

(im new to this, so I might have made some mistakes. I would be grateful for any guidance or feedback)

Jun 18 '24 22:06 sohzm

Hey, nice to see someone working on this. I'd like to get this to work. There's probably some ops that need to be supported by Vulkan upstream, right? I can help with that.

Jun 23 '24 16:06 0cc4m

@0cc4m Thanks for offering help.

Currently the hpp file generated by ggml_vk_generate_shaders.py does not have types like mul_mat_vec_id_q3_k_f32_len, div_f32_len etc

Also some types were renamed eg: dequant_q5_k_len is imported in ggml/src/ggml-vulkan.cpp but header file has dequant_q5_K_len

Im assuming these issues will be solved by your work in llama.cpp? please correct me if Im wrong

Also let me know if I can help with anything

Jun 28 '24 00:06 sohzm

@0cc4m Thanks for offering help.

Currently the hpp file generated by ggml_vk_generate_shaders.py does not have types like mul_mat_vec_id_q3_k_f32_len, div_f32_len etc

Also some types were renamed eg: dequant_q5_k_len is imported in ggml/src/ggml-vulkan.cpp but header file has dequant_q5_K_len

Im assuming these issues will be solved by your work in llama.cpp? please correct me if Im wrong

Also let me know if I can help with anything

It is working in Llama.cpp. I'll take a look at the status in ggml, maybe that needs an update.

Jun 28 '24 08:06 0cc4m

I manually wired up Vulkan and compiled SD.cpp with the latest ggml modified with llama.cpp's modifications to Vulkan. It runs and loads a model, but their Vulkan shaders do not implement CONCAT and it fails.

./sd -m ~/Desktop/Misc/stable_diffusion/a1111/models/Stable-diffusion/ponyDiffusionV6XL_v6.safetensors --prompt "score_9, score_8_up, score_7_up, score_6_up, score_5_up, rainbow dash" -W 1024 -H 1024 -v
Option: 
    n_threads:         8
    mode:              txt2img
    model_path:        /home/david/Desktop/Misc/stable_diffusion/a1111/models/Stable-diffusion/ponyDiffusionV6XL_v6.safetensors
    wtype:             unspecified
    vae_path:          
    taesd_path:        
    esrgan_path:       
    controlnet_path:   
    embeddings_path:   
    stacked_id_embeddings_path:   
    input_id_images_path:   
    style ratio:       20.00
    normzalize input image :  false
    output_path:       output.png
    init_img:          
    control_image:     
    clip on cpu:       false
    controlnet cpu:    false
    vae decoder on cpu:false
    strength(control): 0.90
    prompt:            score_9, score_8_up, score_7_up, score_6_up, score_5_up, rainbow dash
    negative_prompt:   
    min_cfg:           1.00
    cfg_scale:         7.00
    clip_skip:         -1
    width:             1024
    height:            1024
    sample_method:     euler_a
    schedule:          default
    sample_steps:      20
    strength(img2img): 0.75
    rng:               cuda
    seed:              42
    batch_count:       1
    vae_tiling:        false
    upscale_repeats:   1
System Info: 
    BLAS = 1
    SSE3 = 1
    AVX = 1
    AVX2 = 1
    AVX512 = 1
    AVX512_VBMI = 1
    AVX512_VNNI = 1
    FMA = 1
    NEON = 0
    ARM_FMA = 0
    F16C = 1
    FP16_VA = 0
    WASM_SIMD = 0
    VSX = 0
[DEBUG] stable-diffusion.cpp:158  - Using Vulkan backend
ggml_vulkan: Found 1 Vulkan devices:
Vulkan0: NVIDIA RTX A4000 Laptop GPU (NVIDIA) | uma: 0 | fp16: 1 | warp size: 32
[INFO ] stable-diffusion.cpp:178  - loading model from '/home/david/Desktop/Misc/stable_diffusion/a1111/models/Stable-diffusion/ponyDiffusionV6XL_v6.safetensors'
[INFO ] model.cpp:737  - load /home/david/Desktop/Misc/stable_diffusion/a1111/models/Stable-diffusion/ponyDiffusionV6XL_v6.safetensors using safetensors format
[DEBUG] model.cpp:803  - init from '/home/david/Desktop/Misc/stable_diffusion/a1111/models/Stable-diffusion/ponyDiffusionV6XL_v6.safetensors'
[INFO ] stable-diffusion.cpp:201  - Stable Diffusion XL 
[INFO ] stable-diffusion.cpp:207  - Stable Diffusion weight type: f16
[DEBUG] stable-diffusion.cpp:208  - ggml tensor size = 400 bytes
[WARN ] stable-diffusion.cpp:213  - !!!It looks like you are using SDXL model. If you find that the generated images are completely black, try specifying SDXL VAE FP16 Fix with the --vae parameter. You can find it here: https://huggingface.co/madebyollin/sdxl-vae-fp16-fix/blob/main/sdxl_vae.safetensors
[DEBUG] ggml_extend.hpp:884  - clip params backend buffer size =  1564.36 MB(VRAM) (713 tensors)
[DEBUG] ggml_extend.hpp:884  - unet params backend buffer size =  4900.07 MB(VRAM) (1680 tensors)
[DEBUG] ggml_extend.hpp:884  - vae params backend buffer size =  94.47 MB(VRAM) (140 tensors)
[DEBUG] stable-diffusion.cpp:309  - loading vocab
[DEBUG] clip.hpp:164  - vocab size: 49408
[DEBUG] clip.hpp:175  -  trigger word img already in vocab
[DEBUG] stable-diffusion.cpp:329  - loading weights
[DEBUG] model.cpp:1380 - loading tensors from /home/david/Desktop/Misc/stable_diffusion/a1111/models/Stable-diffusion/ponyDiffusionV6XL_v6.safetensors
[INFO ] stable-diffusion.cpp:413  - total params memory size = 6558.89MB (VRAM 6558.89MB, RAM 0.00MB): clip 1564.36MB(VRAM), unet 4900.07MB(VRAM), vae 94.47MB(VRAM), controlnet 0.00MB(VRAM), pmid 0.00MB(VRAM)
[INFO ] stable-diffusion.cpp:432  - loading model from '/home/david/Desktop/Misc/stable_diffusion/a1111/models/Stable-diffusion/ponyDiffusionV6XL_v6.safetensors' completed, taking 4.34s
[INFO ] stable-diffusion.cpp:449  - running in eps-prediction mode
[DEBUG] stable-diffusion.cpp:482  - finished loaded file
[DEBUG] stable-diffusion.cpp:1452 - txt2img 1024x1024
[DEBUG] stable-diffusion.cpp:1207 - prompt after extract and remove lora: "score_9, score_8_up, score_7_up, score_6_up, score_5_up, rainbow dash"
[INFO ] stable-diffusion.cpp:565  - Attempting to apply 0 LoRAs
[INFO ] stable-diffusion.cpp:1212 - apply_loras completed, taking 0.00s
[DEBUG] clip.hpp:1312 - parse 'score_9, score_8_up, score_7_up, score_6_up, score_5_up, rainbow dash' to [['score_9, score_8_up, score_7_up, score_6_up, score_5_up, rainbow dash', 1], ]
[DEBUG] clip.hpp:1152 - token length: 77
[DEBUG] ggml_extend.hpp:838  - clip compute buffer size: 2.56 MB(VRAM)
ggml_vulkan: Error: Missing op: CONCAT
GGML_ASSERT: /home/david/Desktop/Dev/ggml/stable-diffusion.cpp/ggml/src/ggml-vulkan.cpp:5533: false
Aborted (core dumped)

Jul 13 '24 18:07 Cloudwalk9

After adding CONCAT to the relevant place (probably not the solution for that?), it makes it a little further but still fails here:

ggml_backend_vk_graph_compute: error: op not supported  (view) (UNARY)
GGML_ASSERT: /home/david/Desktop/Dev/ggml/stable-diffusion.cpp/ggml/src/ggml-vulkan.cpp:6227: ok

At this point it's beyond my knowledge/skill.

Jul 13 '24 20:07 Cloudwalk9

@Cloudwalk9 Thank you for trying it, I can add the missing ops. Can you upload your progress to a branch that I can access?

Jul 13 '24 20:07 0cc4m

@0cc4m Done, but it's pretty crude. I updated the submodule to point to my fork of ggml with the imported Vulkan stuff, also had to fix some headers. https://github.com/Cloudwalk9/stable-diffusion.cpp

Jul 13 '24 23:07 Cloudwalk9

@0cc4m They just synced the newer Vulkan shader code (split into individual files) from llama.cpp to upstream ggml, so you could probably target ggml directly, instead of my forked submodule.

Jul 28 '24 10:07 Cloudwalk9

@0cc4m They just synced the newer Vulkan shader code (split into individual files) from llama.cpp to upstream ggml, so you could probably target ggml directly, instead of my forked submodule.

Yeah, my WIP branch is here: https://github.com/0cc4m/ggml/tree/vulkan-stable-diffusion-ops

I implemented all the ops, but there's still some bug that makes the image not adhere to the prompt. I'll investigate that later.

Jul 28 '24 12:07 0cc4m

@0cc4m They just synced the newer Vulkan shader code (split into individual files) from llama.cpp to upstream ggml, so you could probably target ggml directly, instead of my forked submodule.

Yeah, my WIP branch is here: https://github.com/0cc4m/ggml/tree/vulkan-stable-diffusion-ops

I implemented all the ops, but there's still some bug that makes the image not adhere to the prompt. I'll investigate that later.

Great work, thank you!

Some ops appear to still be missing when I try to use LoRA (res-adapter):

lora.hpp:67   - finished loaded lora`
lora.hpp:175  - (18 / 18) LoRA tensors applied successfully
ggml_extend.hpp:841  - lora compute buffer size: 112.85 MB(VRAM)
lora.hpp:175  - (18 / 18) LoRA tensors applied successfully
ggml_vulkan: Error: Missing op: ADD for f16 and f32 to f16
D:\a\stable-diffusion.cpp\stable-diffusion.cpp\ggml\src\ggml-vulkan.cpp:4149: fatal error

A different error occurs when I try to use TAESD:

stable-diffusion.cpp:1398 - generating 1 latent images completed, taking 46.07s
stable-diffusion.cpp:1401 - decoding 1 latents
ggml_extend.hpp:841  - taesd compute buffer size: 480.00 MB(VRAM)
ggml_backend_vk_graph_compute: error: op not supported  (view) (UNARY)
D:\a\stable-diffusion.cpp\stable-diffusion.cpp\ggml\src\ggml-vulkan.cpp:6432: GGML_ASSERT(ok) failed

Jul 29 '24 17:07 SkutteOleg

We're finally about to see Stable Diffusion where the only major dependency is your graphics driver...

Jul 29 '24 19:07 Cloudwalk9

@SkutteOleg Thank you, those should be easy to add. I fixed the first bug that caused issues, but I ran into another matmul bug that I have to find in the shader code. I hope I can find it soon.

Jul 30 '24 04:07 0cc4m

LORA and TAESD should work now. I also fixed the matmul bug. It's generating images correctly in my tests, but not that fast yet.

Jul 30 '24 09:07 0cc4m

LORA and TAESD should work now. I also fixed the matmul bug. It's generating images correctly in my tests, but not that fast yet.

It is amazing, actually. It's 2.5 times faster than CUDA12 on my end 😲 (perhaps due to lower memory usage, but i'm not sure)

Jul 30 '24 09:07 SkutteOleg

LORA and TAESD should work now. I also fixed the matmul bug. It's generating images correctly in my tests, but not that fast yet.

It is amazing, actually. It's 2.5 times faster than CUDA12 on my end 😲 (perhaps due to lower memory usage, but i'm not sure)

On which hardware?

Jul 30 '24 10:07 0cc4m

On which hardware?

NVIDIA GeForce GTX 1660 SUPER

EDIT: Also confirmed working reasonably fast on Steam Deck.

Jul 30 '24 10:07 SkutteOleg

It's 2.5 times faster than CUDA12 on my end 😲 (perhaps due to lower memory usage, but i'm not sure)

I had time to do some further testing. Apparently I was comparing the speed to a previous build of sd.cpp. It turns out CUDA12 image generation speed also got faster after ggml update. Even still, Vulkan is 20% faster. However, I was wrong about memory. It appears that Vulkan uses more memory as I can no longer fit both llama.cpp and stable-diffusion.cpp on the GPU at the same time.

UPD: I was testing at 512x512 before. When trying 1024x1024 Vulkan is indeed 15% slower for me. Also, at 1024x1024 it produces broken outputs on my hardware: vulkan_2 vulkan_4

Jul 30 '24 17:07 SkutteOleg

LORA and TAESD should work now. I also fixed the matmul bug. It's generating images correctly in my tests, but not that fast yet.

Excellent work, for me works fine, tested with intel ARC a580

Jul 31 '24 06:07 maxargy

UPD: I was testing at 512x512 before. When trying 1024x1024 Vulkan is indeed 15% slower for me. Also, at 1024x1024 it produces broken outputs on my hardware.

This is a problem with a very large buffer that sd.cpp requests for VAE decoding (?). I cannot fix that on the Vulkan side, but I am throwing an exception now so that it crashes instead of just generating garbage output. Maybe @leejet can think of a solution? Vulkan has a restriction on how large VRAM buffers can be (usually 4GB), and 1024x1024 VAE decoding requests a buffer larger than that.

Jul 31 '24 07:07 0cc4m

there should be VAE-tiling available, or fallback to cpu (not exposed as a cli option afaik).

Jul 31 '24 10:07 Green-Sky

This is a problem with a very large buffer that sd.cpp requests for VAE decoding (?). I cannot fix that on the Vulkan side, but I am throwing an exception now so that it crashes instead of just generating garbage output. Maybe @leejet can think of a solution? Vulkan has a restriction on how large VRAM buffers can be (usually 4GB), and 1024x1024 VAE decoding requests a buffer larger than that.

Shouldn't VAE tiling help with that? This occurs for me even with VAE tiling enabled.

Jul 31 '24 10:07 SkutteOleg

Excellent work, well done. Pictures are generated at 384x384 on my Intel i5-1035G1. output

Jul 31 '24 12:07 JohnArlow

Using the --vae-on-cpu option it will do 512x512 images. Don't understand why VAE should be such a problem, the compute buffer size is 1.6GB in ram. YetanotherCat

Jul 31 '24 13:07 JohnArlow

Tried the vulkan repo from Skuttle, vulkan sdcpp -> 2.12 it/s cuda sdcpp -> 3.95 it/s comfyui -> 1.27 it/s

Nvidia gtx 1650 ti mobile Fedora 40

nearly identical images, though why are some patches different b/w cuda and vulkan?

Jul 31 '24 14:07 offbeat-stuff

This is a problem with a very large buffer that sd.cpp requests for VAE decoding (?). I cannot fix that on the Vulkan side, but I am throwing an exception now so that it crashes instead of just generating garbage output. Maybe @leejet can think of a solution? Vulkan has a restriction on how large VRAM buffers can be (usually 4GB), and 1024x1024 VAE decoding requests a buffer larger than that.

Shouldn't VAE tiling help with that? This occurs for me even with VAE tiling enabled.

It should, and it does in my tests. I can generate 1024x1024 images with SDXL by using --vae-tiling or --vae-on-cpu.

why are some patches different b/w cuda and vulkan?

There are slight differences in how the CUDA and Vulkan backends calculate, for example the CUDA backend uses tensor cores for matrix multiplication, while the Vulkan backend (on Nvidia GPUs) uses the regular CUDA cores. That can change the results slightly. There might also be some minor differences in other operations that contribute to that, too.

Aug 01 '24 07:08 0cc4m

I tried the img2img mode but it immediately raises an error ggml_vulkan: Error: Missing op: PAD

Aug 01 '24 07:08 maxargy

I tried the img2img mode but it immediately raises an error ggml_vulkan: Error: Missing op: PAD

Thank you for reporting that, I forgot to check img2img. It should work now.

Aug 01 '24 08:08 0cc4m

When trying to load any embedding I get this missing vulkan operator:

ggml_vulkan: Error: Missing op: CONCAT for f16 and f16 to f16

Aug 02 '24 10:08 daniandtheweb

When trying to load any embedding I get this missing vulkan operator:
ggml_vulkan: Error: Missing op: CONCAT for f16 and f16 to f16

I can implement that, but it's odd considering that f16 CONCAT is not even implemented for CPU or CUDA. Do embeddings work with those?

Aug 03 '24 08:08 0cc4m