stable-diffusion.cpp [Bug] Slow on gfx1150 with both Vulkan and ROCm builds

Git commit

985aedda32bfd3c3e39d0f6d702483d2ad22a870

Operating System & Version

Arch

GGML backends

Vulkan, HIP

Command-line arguments used

./bin/sd --diffusion-model /data/comfyui/models/diffusion_models/z_image_turbo_bf16.safetensors --vae /data/comfyui/models/vae/ae.safetensors --llm /data/comfyui/models/text_encoders/qwen_3_4b.safetensors -p "A cinematic, melancholic photograph of a solitary hooded figure walking through a sprawling, rain-slicked metropolis at night. The city lights are a chaotic blur of neon orange and cool blue, reflecting on the wet asphalt. The scene evokes a sense of being a single component in a vast machine. Superimposed over the image in a sleek, modern, slightly glitched font is the philosophical quote: 'THE CITY IS A CIRCUIT BOARD, AND I AM A BROKEN TRANSISTOR.' -- moody, atmospheric, profound, dark academic" --cfg-scale 1.0 -v --offload-to-cpu -H 512 -W 512 --rng cpu --steps 5 --rng cpu --seed 1061061743296960 --scheduler simple

Steps to reproduce

Run the commands with either the Vulkan or ROCm build of sd on gfx1150

What you expected to happen

Get a little over 2s/it like in ComfyUI (comfyui is using ROCm)

What actually happened

Get 12s/it with Vulkan and 14s/it with ROCm

Logs / error messages / stack trace

System Info: SSE3 = 1 | AVX = 1 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | VSX = 0 | SDCliParams { mode: img_gen, output_path: "output.png", verbose: true, color: false, canny_preprocess: false, preview_method: none, preview_interval: 1, preview_path: "preview.png", preview_fps: 16, taesd_preview: false, preview_noisy: false } SDContextParams { n_threads: 12, model_path: "", clip_l_path: "", clip_g_path: "", clip_vision_path: "", t5xxl_path: "", llm_path: "/data/comfyui/models/text_encoders/qwen_3_4b.safetensors", llm_vision_path: "", diffusion_model_path: "/data/comfyui/models/diffusion_models/z_image_turbo_bf16.safetensors", high_noise_diffusion_model_path: "", vae_path: "/data/comfyui/models/vae/ae.safetensors", taesd_path: "", esrgan_path: "", control_net_path: "", embedding_dir: "", wtype: NONE, tensor_type_rules: "", lora_model_dir: "", photo_maker_path: "", rng_type: cpu, sampler_rng_type: NONE, flow_shift: INF offload_params_to_cpu: true, control_net_cpu: false, clip_on_cpu: false, vae_on_cpu: false, diffusion_flash_attn: false, diffusion_conv_direct: false, vae_conv_direct: false, chroma_use_dit_mask: true, chroma_use_t5_mask: false, chroma_t5_mask_pad: 1, prediction: NONE, lora_apply_mode: auto, vae_tiling_params: { 0, 0, 0, 0.5, 0, 0 }, force_sdxl_vae_conv_scale: false } SDGenerationParams { prompt: "A cinematic, melancholic photograph of a solitary hooded figure walking through a sprawling, rain-slicked metropolis at night. The city lights are a chaotic blur of neon orange and cool blue, reflecting on the wet asphalt. The scene evokes a sense of being a single component in a vast machine. Superimposed over the image in a sleek, modern, slightly glitched font is the philosophical quote: 'THE CITY IS A CIRCUIT BOARD, AND I AM A BROKEN TRANSISTOR.' -- moody, atmospheric, profound, dark academic", negative_prompt: "", clip_skip: -1, width: 512, height: 512, batch_count: 1, init_image_path: "", end_image_path: "", mask_image_path: "", control_image_path: "", ref_image_paths: [], control_video_path: "", auto_resize_ref_image: true, increase_ref_index: false, pm_id_images_dir: "", pm_id_embed_path: "", pm_style_strength: 20, skip_layers: [7, 8, 9], sample_params: (txt_cfg: 1.00, img_cfg: 1.00, distilled_guidance: 3.50, slg.layer_count: 3, slg.layer_start: 0.01, slg.layer_end: 0.20, slg.scale: 0.00, scheduler: simple, sample_method: NONE, sample_steps: 5, eta: 0.00, shifted_timestep: 0), high_noise_skip_layers: [7, 8, 9], high_noise_sample_params: (txt_cfg: 7.00, img_cfg: 7.00, distilled_guidance: 3.50, slg.layer_count: 3, slg.layer_start: 0.01, slg.layer_end: 0.20, slg.scale: 0.00, scheduler: NONE, sample_method: NONE, sample_steps: 20, eta: 0.00, shifted_timestep: 0), easycache_option: "", easycache: disabled (threshold=0, start=0, end=0), moe_boundary: 0.875, video_frames: 1, fps: 16, vace_strength: 1, strength: 0.75, control_strength: 0.9, seed: 1061061743296960, upscale_repeats: 1, } [DEBUG] stable-diffusion.cpp:167 - Using Vulkan backend [DEBUG] ggml_extend.hpp:66 - ggml_vulkan: Found 1 Vulkan devices: [DEBUG] ggml_extend.hpp:66 - ggml_vulkan: 0 = AMD Radeon 890M Graphics (RADV GFX1150) (radv) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat [INFO ] stable-diffusion.cpp:234 - loading diffusion model from '/data/comfyui/models/diffusion_models/z_image_turbo_bf16.safetensors' [INFO ] model.cpp:385 - load /data/comfyui/models/diffusion_models/z_image_turbo_bf16.safetensors using safetensors format [DEBUG] model.cpp:515 - init from '/data/comfyui/models/diffusion_models/z_image_turbo_bf16.safetensors', prefix = 'model.diffusion_model.' [INFO ] stable-diffusion.cpp:281 - loading llm from '/data/comfyui/models/text_encoders/qwen_3_4b.safetensors' [INFO ] model.cpp:385 - load /data/comfyui/models/text_encoders/qwen_3_4b.safetensors using safetensors format [DEBUG] model.cpp:515 - init from '/data/comfyui/models/text_encoders/qwen_3_4b.safetensors', prefix = 'text_encoders.llm.' [INFO ] stable-diffusion.cpp:295 - loading vae from '/data/comfyui/models/vae/ae.safetensors' [INFO ] model.cpp:385 - load /data/comfyui/models/vae/ae.safetensors using safetensors format [DEBUG] model.cpp:515 - init from '/data/comfyui/models/vae/ae.safetensors', prefix = 'vae.' [INFO ] stable-diffusion.cpp:318 - Version: Z-Image [INFO ] stable-diffusion.cpp:346 - Weight type stat: f32: 1095 [INFO ] stable-diffusion.cpp:347 - Conditioner weight type stat: f32: 398
[INFO ] stable-diffusion.cpp:348 - Diffusion model weight type stat: f32: 453
[INFO ] stable-diffusion.cpp:349 - VAE weight type stat: f32: 244
[DEBUG] stable-diffusion.cpp:351 - ggml tensor size = 400 bytes [DEBUG] llm.hpp:285 - merges size 151387 [DEBUG] llm.hpp:317 - vocab size: 151665 [DEBUG] ggml_extend.hpp:1873 - qwen3 params backend buffer size = 7672.62 MB(RAM) (398 tensors) [DEBUG] ggml_extend.hpp:1873 - z_image params backend buffer size = 23479.11 MB(RAM) (453 tensors) [DEBUG] ggml_extend.hpp:1873 - vae params backend buffer size = 94.57 MB(RAM) (138 tensors) [DEBUG] stable-diffusion.cpp:683 - loading weights [DEBUG] model.cpp:1363 - using 12 threads for model loading [DEBUG] model.cpp:1385 - loading tensors from /data/comfyui/models/diffusion_models/z_image_turbo_bf16.safetensors |====================> | 453/1095 - 109.84it/s [DEBUG] model.cpp:1385 - loading tensors from /data/comfyui/models/text_encoders/qwen_3_4b.safetensors |======================================> | 851/1095 - 116.51it/s [DEBUG] model.cpp:1385 - loading tensors from /data/comfyui/models/vae/ae.safetensors |==================================================| 1095/1095 - 145.92it/s [INFO ] model.cpp:1588 - loading tensors completed, taking 7.50s (process: 0.00s, read: 5.71s, memcpy: 0.00s, convert: 1.05s, copy_to_backend: 0.00s) [DEBUG] stable-diffusion.cpp:710 - finished loaded file [INFO ] stable-diffusion.cpp:767 - total params memory size = 31246.31MB (VRAM 31246.31MB, RAM 0.00MB): text_encoders 7672.62MB(VRAM), diffusion_model 23479.11MB(VRAM), vae 94.57MB(VRAM), controlnet 0.00MB(VRAM), pmid 0.00MB(VRAM) [INFO ] stable-diffusion.cpp:850 - running in FLOW mode [DEBUG] stable-diffusion.cpp:3146 - generate_image 512x512 [INFO ] stable-diffusion.cpp:3177 - sampling using Euler method [INFO ] denoiser.hpp:388 - get_sigmas with Simple scheduler [INFO ] stable-diffusion.cpp:3290 - TXT2IMG [DEBUG] conditioner.hpp:1701 - parse '<|im_start|>user A cinematic, melancholic photograph of a solitary hooded figure walking through a sprawling, rain-slicked metropolis at night. The city lights are a chaotic blur of neon orange and cool blue, reflecting on the wet asphalt. The scene evokes a sense of being a single component in a vast machine. Superimposed over the image in a sleek, modern, slightly glitched font is the philosophical quote: 'THE CITY IS A CIRCUIT BOARD, AND I AM A BROKEN TRANSISTOR.' -- moody, atmospheric, profound, dark academic<|im_end|> <|im_start|>assistant ' to [['<|im_start|>user ', 1], ['A cinematic, melancholic photograph of a solitary hooded figure walking through a sprawling, rain-slicked metropolis at night. The city lights are a chaotic blur of neon orange and cool blue, reflecting on the wet asphalt. The scene evokes a sense of being a single component in a vast machine. Superimposed over the image in a sleek, modern, slightly glitched font is the philosophical quote: 'THE CITY IS A CIRCUIT BOARD, AND I AM A BROKEN TRANSISTOR.' -- moody, atmospheric, profound, dark academic', 1], ['<|im_end|> <|im_start|>assistant ', 1], ] [DEBUG] llm.hpp:259 - split prompt "<|im_start|>user " to tokens ["<|im_start|>", "user", "Ċ", ] [DEBUG] llm.hpp:259 - split prompt "A cinematic, melancholic photograph of a solitary hooded figure walking through a sprawling, rain-slicked metropolis at night. The city lights are a chaotic blur of neon orange and cool blue, reflecting on the wet asphalt. The scene evokes a sense of being a single component in a vast machine. Superimposed over the image in a sleek, modern, slightly glitched font is the philosophical quote: 'THE CITY IS A CIRCUIT BOARD, AND I AM A BROKEN TRANSISTOR.' -- moody, atmospheric, profound, dark academic" to tokens ["A", "Ġcinematic", ",", "Ġmelanch", "olic", "Ġphotograph", "Ġof", "Ġa", "Ġsolitary", "Ġhood", "ed", "Ġfigure", "Ġwalking", "Ġthrough", "Ġa", "Ġsprawling", ",", "Ġrain", "-s", "lick", "ed", "Ġmet", "ropolis", "Ġat", "Ġnight", ".", "ĠThe", "Ġcity", "Ġlights", "Ġare", "Ġa", "Ġchaotic", "Ġblur", "Ġof", "Ġneon", "Ġorange", "Ġand", "Ġcool", "Ġblue", ",", "Ġreflecting", "Ġon", "Ġthe", "Ġwet", "Ġasphalt", ".", "ĠThe", "Ġscene", "Ġev", "okes", "Ġa", "Ġsense", "Ġof", "Ġbeing", "Ġa", "Ġsingle", "Ġcomponent", "Ġin", "Ġa", "Ġvast", "Ġmachine", ".", "ĠSuper", "im", "posed", "Ġover", "Ġthe", "Ġimage", "Ġin", "Ġa", "Ġsleek", ",", "Ġmodern", ",", "Ġslightly", "Ġglitch", "ed", "Ġfont", "Ġis", "Ġthe", "Ġphilosophical", "Ġquote", ":", "Ġ'", "THE", "ĠCITY", "ĠIS", "ĠA", "ĠC", "IR", "CU", "IT", "ĠBOARD", ",", "ĠAND", "ĠI", "ĠAM", "ĠA", "ĠBRO", "KEN", "ĠTRANS", "IST", "OR", ".'", "Ġ--", "Ġmo", "ody", ",", "Ġatmospheric", ",", "Ġprofound", ",", "Ġdark", "Ġacademic", ] [DEBUG] llm.hpp:259 - split prompt "<|im_end|> <|im_start|>assistant " to tokens ["<|im_end|>", "Ċ", "<|im_start|>", "assistant", "Ċ", ] [INFO ] ggml_extend.hpp:1786 - qwen3 offload params (7672.62 MB, 398 tensors) to runtime backend (Vulkan0), taking 1.07s [DEBUG] ggml_extend.hpp:1688 - qwen3 compute buffer size: 13.34 MB(VRAM) [DEBUG] conditioner.hpp:1896 - computing condition graph completed, taking 1638 ms [INFO ] stable-diffusion.cpp:2921 - get_learned_condition completed, taking 1640 ms [INFO ] stable-diffusion.cpp:3032 - generating image: 1/1 - seed 1061061743296960 [INFO ] ggml_extend.hpp:1786 - z_image offload params (23479.11 MB, 453 tensors) to runtime backend (Vulkan0), taking 2.67s [DEBUG] ggml_extend.hpp:1688 - z_image compute buffer size: 255.60 MB(VRAM) |==================================================| 5/5 - 12.12s/it [INFO ] stable-diffusion.cpp:3074 - sampling completed, taking 60.58s [INFO ] stable-diffusion.cpp:3085 - generating 1 latent images completed, taking 60.66s [INFO ] stable-diffusion.cpp:3088 - decoding 1 latents [INFO ] ggml_extend.hpp:1786 - vae offload params ( 94.57 MB, 138 tensors) to runtime backend (Vulkan0), taking 0.02s [DEBUG] ggml_extend.hpp:1688 - vae compute buffer size: 2112.25 MB(VRAM) [DEBUG] stable-diffusion.cpp:2291 - computing vae decode graph completed, taking 9.97s [INFO ] stable-diffusion.cpp:3098 - latent 1 decoded, taking 10.00s [INFO ] stable-diffusion.cpp:3102 - decode_first_stage completed, taking 10.00s [INFO ] stable-diffusion.cpp:3398 - generate_image completed in 72.30s save result PNG image to 'output.png' (success)

Additional context / environment details

No response

Dec 05 '25 06:12 bitgamma

Vulkan exponentially slower with gfx1102 on self compile, but ROCM normal (Comfy, Invoke etc comparable) speeds. Latest whatever Manjaro is pushing ROCM/Driver wise.

Dec 08 '25 04:12 NoThanksGoAway

updated: retested with 96c3e64 and the results are very different and put the ROCm backend ahead, just as noted by @NoThanksGoAway.

ROCm: ~4s/it Vulkan: ~9s/it

also tried 1024x1024

ROCm: 24s/it Vulkan: 42s/it (didn't finish, OOM)

If I had to guess, I'd say it is thanks to bfbb929 but I am not sure.

Adding --diffusion-fa seems to further close the gap with ComfyUI (1024x1024)

ROCm: ~13.5s/it ComfyUI: ~9s/it

I wonder if this could be further improved

Dec 09 '25 09:12 bitgamma