ComfyUI icon indicating copy to clipboard operation
ComfyUI copied to clipboard

Performance Regression: v0.6.0 is significantly slower than v0.4.0 specifically when using z-image-turbo models

Open austinrick opened this issue 1 month ago • 5 comments

Custom Node Testing

Your question

v0.6.0 is noticeably slower than v0.4.0 when running the KSampler (z-image model).

Test Methodology: To ensure a fair comparison, I conducted the tests using the official example workflow: Workflow: Official Text-to-Image example. (Templates>Image>Z-Image-Turbo Text to Image) Model: z_image_turbo_bf16.safetensors. Environment: Identical hardware, dependencies, and settings for both versions.

Please refer to the log below: ComfyUI version 0.6.0-Z_Image.log.txt ComfyUI version 0.4.0-Z_Image.log.txt

Logs


Other

No response

austinrick avatar Dec 24 '25 12:12 austinrick

Are you 100% sure you're using the exact same workflow setup in both tests? If you used a different sampler, it could slow things down. Like if you used euler in one and huen in the other. Some samplers are _2s under the hood(~2x as slow per step) and some are _2s for a step or two, before switching to single step.

But if they are identical workflows, then it's probably tied to https://github.com/comfyanonymous/ComfyUI/pull/11344 if it appears to only be affecting Z-Image-Turbo. You're on a 2060, so it might not like doing inference in certain formats. I kind of remember having issues with my 2080 being like half as fast at one of the formats.

RandomGitUser321 avatar Dec 25 '25 07:12 RandomGitUser321

Are you 100% sure you're using the exact same workflow setup in both tests? If you used a different sampler, it could slow things down. Like if you used euler in one and huen in the other. Some samplers are _2s under the hood(~2x as slow per step) and some are _2s for a step or two, before switching to single step.

But if they are identical workflows, then it's probably tied to #11344 if it appears to only be affecting Z-Image-Turbo. You're on a 2060, so it might not like doing inference in certain formats. I kind of remember having issues with my 2080 being like half as fast at one of the formats.

Hi RandomGitUser321,

Thank you for the reminder.

The workflow I’m using was opened from Templates > Image > Z-Image-Turbo Text to Image. In this workflow, the default sampler_name is res_multistep. When testing v0.4.0 and v0.6.0, I did not change any parameters. I used the same prompt set to generate images (not the default one—slightly longer than the official example).

You might want to download ComfyUI_windows_portable_nvidia (v0.4.0) and the latest v0.6.0, then generate images using the same workflow (the official default one should also be fine). By testing and comparing the generation speed side by side, you’ll likely see what I’m referring to.

That said, different hardware environments may produce different results. I appreciate your reminder, and when a newer version is released, I’ll test again using the same workflow for comparison.

For now, I’m keeping v0.4.0 with the Z-Image-Turbo model for text-to-image creation, while using v0.6.0 mainly to test new features. At this point, however, v0.6.0 is not effective for text-to-image creation with the Z-Image-Turbo model in my workflow.

austinrick avatar Dec 25 '25 09:12 austinrick

Can you post screenshots of your task manager GPU usage while inferencing, for both versions? You're already offloading in both cases, but it's possible that one of them is using more shared memory or something, due to the image size, which might mean more shuffling back and forth.

RandomGitUser321 avatar Dec 25 '25 22:12 RandomGitUser321

Can you post screenshots of your while inferencing, for both versions? You're already offloading in both cases, but it's possible that one of them is using more shared memory or something, due to the image size, which might mean more shuffling back and forth.

Hi RandomGitUser321,

The attached archive contains the complete logs and text-to-image Task Manager GPU usage screenshots for Z-Image-Turbo runs 1 through 7 on v0.4.0 and v0.6.0.

From the “prompt executed seconds” values in the logs, it can be observed that in v0.6.0, the generation time becomes slower and longer after multiple runs, whereas v0.4.0 maintains stable and consistent generation times.

Below are the Task Manager GPU usage screenshots from the 7th run for v0.4.0 and v0.6.0 respectively. Since the execution time in v0.6.0 is significantly longer, the screenshots are split into Start and End images:

screenshots(task manager GPU usage)_and_log_v0.4.0 20251226-1003.zip screenshots(task manager GPU usage)_and_log_v0.6.0 20251226-0951.zip

ComfyUI v0.4.0 Run#7 with z-image-turbo (Prompt executed in 30.93 seconds)

Image

ComfyUI v0.6.0 Run#7 with z-image-turbo (Prompt executed in 57.48 seconds)

Start Image End Image

austinrick avatar Dec 26 '25 02:12 austinrick

Yeah it looks like it's using roughly the same amounts of VRAM and shared RAM. I'm leaning toward it being an inference precision issue related to the PR I linked earlier:

But if they are identical workflows, then it's probably tied to #11344 if it appears to only be affecting Z-Image-Turbo. You're on a 2060, so it might not like doing inference in certain formats. I kind of remember having issues with my 2080 being like half as fast at one of the formats.

On v0.6.0, you might want to try this and test each of these options:

Image

I'm not positive it will work, but this would be your best option to see if one of them matches the performance of the v0.4.0 version.

Here's an example of testing with the various compute formats on my PC with a 7900xt:

Image

So in my case, my s/it are around 15% slower by using fp16, instead of using bf16. I tested it multiple times and the results were all roughly the same for each compute format. Default is just bf16, in my case. Note this card's horrible fp32 performance.

RandomGitUser321 avatar Dec 26 '25 04:12 RandomGitUser321