[bug]: Flux Extremely Slow in Invoke Compared to ComfyUI and Forge
Is there an existing issue for this problem?
- [x] I have searched the existing issues
Operating system
Windows 11 23H2 22631.4751
GPU vendor
Nvidia (CUDA)
GPU model
RTX 2070 Super, driver version: 565.90
GPU VRAM
8GB
Version number
5.6.0
Browser
Google Chrome 132.0.6834.160, Invoke Client
Python dependencies
No response
What happened
Hello,
I’m experiencing an issue with Flux models running extremely slowly in Invoke. I’m using Invoke version 5.6.0, installed via Launcher. For comparison, in ComfyUI I achieve a generation speed of approximately 3.2–3.3 seconds per iteration at a resolution of 832×1216. However, using the same resolution in Invoke results in a staggering ~20 seconds per iteration, and I eventually encounter an OOM error. This is despite ComfyUI consuming significantly less VRAM, RAM, and swap. Even Stable Diffusion WebUI Forge runs at roughly the same speed as ComfyUI—although it uses noticeably more RAM, which sometimes leads to OOM errors (when I use multiple LoRAs).
I’m using the following models in ComfyUI:
flux1-dev-Q8_0.gguf majicFlus 麦橘超然
t5xxl_fp16.safetensors clip_l.safetensors
I attempted to import text encoders into Invoke, but was unsuccessful, so I loaded the provided analogues instead.
I’ve also tried adding enable_partial_loading: true in the invokeai.yaml file and experimenting with various combinations of max_cache_ram_gb, max_cache_vram_gb and device_working_mem_gb, but nothing seems to make a fundamental difference. (Note: “Nvidia sysmem fallback” is also disabled.)
I really enjoy using Invoke—I periodically download it to see how it evolves—but I always end up returning to ComfyUI because even the SDXL models in Invoke run slightly slower, and I encounter OOM errors at resolutions where ComfyUI doesn’t even require tiling. I’d love to use Invoke because of its pleasant UI and fantastic inpainting capabilities, but unfortunately, optimization for non-high-end configurations still leaves much to be desired. If needed, I can provide additional information to help diagnose the issue.
Thank you for your attention.
What you expected to happen
I acknowledge that my configuration (8GB of VRAM and 32GB of RAM) isn’t ideal for the demanding Flux models. Nevertheless, both ComfyUI and Forge handle these models at speeds that are acceptable for me. I would like to confirm that the issue isn’t on my end, and I hope to see better optimization in Invoke. This is especially important since img2img, even at higher SDXL resolutions, and upscaling consistently lead to OOM errors in Invoke—whereas ComfyUI performs swiftly at even higher resolutions with lower resource usage. Additionally, when using tiled decoding in the VAE, ComfyUI is capable of handling extraordinarily high resolutions on a system with just 8GB of VRAM by Invoke’s standards.
How to reproduce the problem
- Use an Nvidia GPU with 8GB of VRAM; and 32GB of RAM
- Standard install Invoke 5.6.0 via Launcher
- Disable “Nvidia sysmem fallback” in Nvidia driver settings
- Add
enable_partial_loading: trueto the Invoke.yaml file - Generate a 832x1216 or 1024x1024 image to canvas using flux1-dev-Q8_0.gguf or majicFlus 麦橘超然 (any other Flux model of the same size can be used) + t5xxl_fp16, clip_l and vae
Additional context
No response
Discord username
No response
I conducted additional tests using SDXL models in Invoke (versions 5.6.0 and 5.6.1rc1), as Flux models failed to generate any output. Below are my findings compared to ComfyUI:
- At 832x1216 resolution, Invoke matches ComfyUI’s generation speed (~1.90 it/s) and VRAM/RAM usage only when
enable_partial_loading: trueis disabled. Adding LoRAs under these conditions has minimal impact on speed. - Enabling
enable_partial_loading: truereduces speed to 1.3-1.4 it/s. - Combining
max_cache_ram_gb: 28improves speed to ~1.60 it/s. - Adding
max_cache_vram_gb: 5further increases speed to ~1.8 it/s but introduces OOM errors during img2img upscaling (even at x1.5). In contrast, ComfyUI handles x2.5 upscales (non-img2img, dedicated upscale models) without issues. - With
enable_partial_loading: trueactive, adding multiple LoRAs drops generation speed to 1 it/s. Experimentation withmax_cache_vram_gbanddevice_working_mem_gbyielded no viable solutions—only OOM errors or unacceptably slow speeds.
I absolutely love the incredible inpainting, outpainting, and all those nifty image editing features that Invoke offers. That said, for me, all this potential is a bit held back by Invoke’s pretty high demands. I’m not exactly a tech expert, and I don’t really understand all the “magic” the Comfy folks work behind the scenes—but for users with limited VRAM, ComfyUI is a real lifesaver. It even lets you crank out high-resolution images with Flux models (FP8/Q8_0 + t5xxl_fp16 + a few LoRAs)!
The only downside is that ComfyUI can’t quite match Invoke’s sleek, user-friendly interface or its mind-blowing inpainting capabilities. I really hope the InvokeAI team takes note and manages to optimize things to at least ComfyUI’s level—if not even better!
Thanks!
hast there not been any progress on this? I was excited to try flux on my 3090 and it is unusably slow. I love invokeAI's ui and really dont want to switch.
@ciriguaya Actually, I think the developers are not interested in optimizing and popularizing InvokeAI. There are a couple of similar issues here, and after months, there’s still no response from anyone. I’ve tried installing Invoke many times to track its progress. While I’ve always been pleasantly impressed by their frontend, their backend is unbelievably sluggish and outdated. Invoke consistently runs slower while consuming more resources, and its features pale in comparison to even the basic ComfyUI (I'm not talking about the huge number of third-party nodes from enthusiasts). The last time I tried Invoke, it still lacked support for different ESRGAN upscalers — this is something that was in A1111 back in 2023... I’ve given up on returning to Invoke, as it’s just a waste of time, and now I’ve fully switched to ComfyUI + this Krita plugin.
@Yaruze66 How did you even get it to run at 1it/s? I have a 2070 and it takes 6 FULL MINUTES for a single iteration (not an image, an iteration!). I have 8GB VRAM and 48GB RAM. I tried Flux Q3, which is only 5GB in size. And I too wasn't able to import existing text encoders from ComfyUI and had to download from starter models. If I disable cuda sysmem fallback I simply get OOM. But it shows what the issue is, which is that it's loading the text encoder into VRAM. Again, how did you even manage to get it working at all?
This is a real issue even with my 3060 12GB, generation takes around 5 minutes to me, I downloaded the recommended quantized model for Flux Kontext on the models tab, still extremely slow.