stable-diffusion.cpp qwen-image-edit slow as hell on Rocm around 1/2 the speed compared to ComfyUI

Getting around 134s/it with sd.cpp while with the same settings in ComfyUI i get 64s/it.

i know its slow at all but something seems wrong here.

Compared to qwen-image 39s/it sd.cpp 24s/it ComfyUI. Also slower but edit is horror...

Maybe its my old iron GCN5 but maybe someone else could compare CDNA RDNA different GPU's.

Actually im using a Q3_K_S Model. Larger models doesnt fit into my 16GB VRAM only while loading partially in ComfyUI, not possible with sd.cpp

Oct 16 '25 06:10 phil2sat

Did you enable --diffusion-fa?

Oct 16 '25 13:10 leejet

no since flash attention doesnt run on my gpu, with fa i guess it needs around 1000s/it. like in comfyui no flash attention, no sage attention, using default opt sub quad attention is fine, with my actual comfiui workflow its even faster.

there is something going on with the input image scaling, default comfyui scales input image to 1024x1024 and thats slow in inference and results in weird output. https://github.com/comfyanonymous/ComfyUI/pull/10239/commits/82cece459473cd09244abaf1ecb9e130d13ddf83

someone wrote an exchange qwenimageedittext node where i can set the scaling width to match the empty latent image resolution. https://huggingface.co/Phr00t/Qwen-Image-Edit-Rapid-AIO/blob/main/fixed-textencode-node/nodes_qwen.py

lets say input has 832x1248 the node scales to 1024 and the latent is again 832x1248 = 64s/it, weird cropped image, sometimes qwen hallucinates something in the free space, like on an group image a new unknown person

new node input has 832x1248 the node scales to 832 and the latent is again 832x1248 = 45s/it, exact uncropped edit

the gained speed alone is nice and there are reddit posts about this and not using the vae input than rather inputting the img latent directly. https://www.reddit.com/r/StableDiffusion/comments/1muiozf/pay_attention_to_qwenimageedits_workflow_to/

Maybe here is something similar going on, didnt look into the sd.cpp qwen image edit code. dont know how the image is scaled or if its scaled. but the formula to scale the input image is to fit in one dimension and fill the other with nothing.

hope someone understand what i mean but tested the whole day on comfyui and different input resolutions. i know sd.cpp is not comfyui but how to manage the input + scaling + conditioning is similar

Oct 16 '25 19:10 phil2sat

ComfyUI uses PyTorch’s scaled_dot_product_attention by default, which automatically utilizes Flash Attention internally. For more details, refer to: https://docs.pytorch.org/docs/stable/generated/torch.nn.functional.scaled_dot_product_attention.html

Oct 17 '25 13:10 leejet

For the reference image used in Qwen Image Edit, sd.cpp automatically scales it to roughly match the size of the input image, with a maximum size of about 1024×1024. This helps avoid various issues such as truncation or incorrect padding.

Oct 17 '25 13:10 leejet

no since flash attention doesnt run on my gpu, with fa i guess it needs around 1000s/it.

What issues did you encounter when using sd.cpp with Flash Attention?

Oct 17 '25 13:10 leejet

black image and speed as slow as hell, it runs but it takes forever to finish and the result is black. its a GNC5 gfx900 thing as no mat mul and wmma i guess

Oct 20 '25 08:10 phil2sat