qwen-image-edit slow as hell on Rocm around 1/2 the speed compared to ComfyUI
Getting around 134s/it with sd.cpp while with the same settings in ComfyUI i get 64s/it.
i know its slow at all but something seems wrong here.
Compared to qwen-image 39s/it sd.cpp 24s/it ComfyUI. Also slower but edit is horror...
Maybe its my old iron GCN5 but maybe someone else could compare CDNA RDNA different GPU's.
Actually im using a Q3_K_S Model. Larger models doesnt fit into my 16GB VRAM only while loading partially in ComfyUI, not possible with sd.cpp
Did you enable --diffusion-fa?
no since flash attention doesnt run on my gpu, with fa i guess it needs around 1000s/it. like in comfyui no flash attention, no sage attention, using default opt sub quad attention is fine, with my actual comfiui workflow its even faster.
there is something going on with the input image scaling, default comfyui scales input image to 1024x1024 and thats slow in inference and results in weird output. https://github.com/comfyanonymous/ComfyUI/pull/10239/commits/82cece459473cd09244abaf1ecb9e130d13ddf83
someone wrote an exchange qwenimageedittext node where i can set the scaling width to match the empty latent image resolution. https://huggingface.co/Phr00t/Qwen-Image-Edit-Rapid-AIO/blob/main/fixed-textencode-node/nodes_qwen.py
lets say input has 832x1248 the node scales to 1024 and the latent is again 832x1248 = 64s/it, weird cropped image, sometimes qwen hallucinates something in the free space, like on an group image a new unknown person
new node input has 832x1248 the node scales to 832 and the latent is again 832x1248 = 45s/it, exact uncropped edit
the gained speed alone is nice and there are reddit posts about this and not using the vae input than rather inputting the img latent directly. https://www.reddit.com/r/StableDiffusion/comments/1muiozf/pay_attention_to_qwenimageedits_workflow_to/
Maybe here is something similar going on, didnt look into the sd.cpp qwen image edit code. dont know how the image is scaled or if its scaled. but the formula to scale the input image is to fit in one dimension and fill the other with nothing.
hope someone understand what i mean but tested the whole day on comfyui and different input resolutions. i know sd.cpp is not comfyui but how to manage the input + scaling + conditioning is similar
ComfyUI uses PyTorch’s scaled_dot_product_attention by default, which automatically utilizes Flash Attention internally. For more details, refer to: https://docs.pytorch.org/docs/stable/generated/torch.nn.functional.scaled_dot_product_attention.html
For the reference image used in Qwen Image Edit, sd.cpp automatically scales it to roughly match the size of the input image, with a maximum size of about 1024×1024. This helps avoid various issues such as truncation or incorrect padding.
no since flash attention doesnt run on my gpu, with fa i guess it needs around 1000s/it.
What issues did you encounter when using sd.cpp with Flash Attention?
black image and speed as slow as hell, it runs but it takes forever to finish and the result is black. its a GNC5 gfx900 thing as no mat mul and wmma i guess