Performance Regression: v0.6.0 is significantly slower than v0.4.0 specifically when using z-image-turbo models
Custom Node Testing
- [x] I have tried disabling custom nodes and the issue persists (see how to disable custom nodes if you need help)
Your question
v0.6.0 is noticeably slower than v0.4.0 when running the KSampler (z-image model).
Test Methodology: To ensure a fair comparison, I conducted the tests using the official example workflow: Workflow: Official Text-to-Image example. (Templates>Image>Z-Image-Turbo Text to Image) Model: z_image_turbo_bf16.safetensors. Environment: Identical hardware, dependencies, and settings for both versions.
Please refer to the log below: ComfyUI version 0.6.0-Z_Image.log.txt ComfyUI version 0.4.0-Z_Image.log.txt
Logs
Other
No response
Are you 100% sure you're using the exact same workflow setup in both tests? If you used a different sampler, it could slow things down. Like if you used euler in one and huen in the other. Some samplers are _2s under the hood(~2x as slow per step) and some are _2s for a step or two, before switching to single step.
But if they are identical workflows, then it's probably tied to https://github.com/comfyanonymous/ComfyUI/pull/11344 if it appears to only be affecting Z-Image-Turbo. You're on a 2060, so it might not like doing inference in certain formats. I kind of remember having issues with my 2080 being like half as fast at one of the formats.
Are you 100% sure you're using the exact same workflow setup in both tests? If you used a different sampler, it could slow things down. Like if you used euler in one and huen in the other. Some samplers are _2s under the hood(~2x as slow per step) and some are _2s for a step or two, before switching to single step.
But if they are identical workflows, then it's probably tied to #11344 if it appears to only be affecting Z-Image-Turbo. You're on a 2060, so it might not like doing inference in certain formats. I kind of remember having issues with my 2080 being like half as fast at one of the formats.
Hi RandomGitUser321,
Thank you for the reminder.
The workflow I’m using was opened from Templates > Image > Z-Image-Turbo Text to Image. In this workflow, the default sampler_name is res_multistep. When testing v0.4.0 and v0.6.0, I did not change any parameters. I used the same prompt set to generate images (not the default one—slightly longer than the official example).
You might want to download ComfyUI_windows_portable_nvidia (v0.4.0) and the latest v0.6.0, then generate images using the same workflow (the official default one should also be fine). By testing and comparing the generation speed side by side, you’ll likely see what I’m referring to.
That said, different hardware environments may produce different results. I appreciate your reminder, and when a newer version is released, I’ll test again using the same workflow for comparison.
For now, I’m keeping v0.4.0 with the Z-Image-Turbo model for text-to-image creation, while using v0.6.0 mainly to test new features. At this point, however, v0.6.0 is not effective for text-to-image creation with the Z-Image-Turbo model in my workflow.
Can you post screenshots of your task manager GPU usage while inferencing, for both versions? You're already offloading in both cases, but it's possible that one of them is using more shared memory or something, due to the image size, which might mean more shuffling back and forth.
Can you post screenshots of your while inferencing, for both versions? You're already offloading in both cases, but it's possible that one of them is using more shared memory or something, due to the image size, which might mean more shuffling back and forth.
Hi RandomGitUser321,
The attached archive contains the complete logs and text-to-image Task Manager GPU usage screenshots for Z-Image-Turbo runs 1 through 7 on v0.4.0 and v0.6.0.
From the “prompt executed seconds” values in the logs, it can be observed that in v0.6.0, the generation time becomes slower and longer after multiple runs, whereas v0.4.0 maintains stable and consistent generation times.
Below are the Task Manager GPU usage screenshots from the 7th run for v0.4.0 and v0.6.0 respectively. Since the execution time in v0.6.0 is significantly longer, the screenshots are split into Start and End images:
screenshots(task manager GPU usage)_and_log_v0.4.0 20251226-1003.zip screenshots(task manager GPU usage)_and_log_v0.6.0 20251226-0951.zip
ComfyUI v0.4.0 Run#7 with z-image-turbo (Prompt executed in 30.93 seconds)
ComfyUI v0.6.0 Run#7 with z-image-turbo (Prompt executed in 57.48 seconds)
Start
End
Yeah it looks like it's using roughly the same amounts of VRAM and shared RAM. I'm leaning toward it being an inference precision issue related to the PR I linked earlier:
But if they are identical workflows, then it's probably tied to #11344 if it appears to only be affecting Z-Image-Turbo. You're on a 2060, so it might not like doing inference in certain formats. I kind of remember having issues with my 2080 being like half as fast at one of the formats.
On v0.6.0, you might want to try this and test each of these options:
I'm not positive it will work, but this would be your best option to see if one of them matches the performance of the v0.4.0 version.
Here's an example of testing with the various compute formats on my PC with a 7900xt:
So in my case, my s/it are around 15% slower by using fp16, instead of using bf16. I tested it multiple times and the results were all roughly the same for each compute format. Default is just bf16, in my case. Note this card's horrible fp32 performance.