ComfyUI After updating to V0.3.59 of confui, some nodes such as VAE encoding and decoding, as well as image zooming, have become extremely slow and laggy.

Custom Node Testing

[ ] I have tried disabling custom nodes and the issue persists (see how to disable custom nodes if you need help)

Your question

After the update to version V0.3.59, the nodes for VAE encoding and decoding, as well as image enlargement, consume significantly high amounts of video memory and operate very slowly, whereas they functioned normally in versions V0.3.57 and below. RTX4080 with 64GB of memory.

V0.3.59

V0.3.57

Thank you very much. I hope to get some help

Logs

Other

No response

Sep 11 '25 15:09 kajfblsdkgnlsndg

I'm struggling with this as well. All of my current workflows are broken. I have a 4090 and can no longer process any sdxl images. It's spiking immediately up to 24gig of vram and failing with VAE trying to switch to tiled mode. I installed of 2nd copy of comfyui with the Comfyui Easy Installer for pixorama and had the same problem. What I found is removing the rgthree node makes things functional again, just none of my workflows. Default simple comfyui workflows work properly.

Sep 11 '25 18:09 kidkool28

Yes, especially VAE encoding and decoding, which also consume a lot of video memory and are 10 times slower than before

Sep 13 '25 01:09 kajfblsdkgnlsndg

Same issue, so I added nodes like 'clean vram used, set reserved vram ,delay' to make the workflow run normally.

Sep 17 '25 14:09 cdmusic2019

I found the reason, it was caused by 'llama-cpp-python'. I just need to delete 'ComfyUI-JoyCaption' and 'ComfyUI-MiniCPM' plugins, because they use 'llama_cpp_install.py'. Deleting them will restore to normal. It has nothing to do with comfyui itself.

Sep 22 '25 03:09 cdmusic2019

Deleting these two plugins still didn't work. They still consume a lot of VRAM and are very slow. I have reverted back to version 0.3.56

Sep 22 '25 11:09 kajfblsdkgnlsndg

By the way, I have 16G VRAM, and I have successfully tested it on pytorch version: 2.7.1+cu126 and pytorch version: 2.6+cu126.

Sep 22 '25 11:09 cdmusic2019

I have an RTX4080 with 16GB + 64GB of memory Python version: 3.12.3 pytorch version: 2.7.0+cu128

Sep 22 '25 13:09 kajfblsdkgnlsndg

Just like you. Back in the past, processing images of this size wouldn't take more than a second on 5090. Now I can only roll back to the previous version of Comfy to fix the problem.

Sep 22 '25 14:09 Inmanguo

Can you try after disabling custom nodes? This will help identify whether the issue should be addressed here or somewhere else.

Sep 22 '25 16:09 christian-byrne

I'm experiencing the same issue. After upgrading from v0.3.56 to v0.3.57, the VAE decode process has become extremely slow and the CUDA utilization is very low

Sep 24 '25 05:09 darklight1992

I installed JoyCaption yesterday and now run into that issue, after disabling JoyCaption again it works as before. Other than VAE, which let the VRAM usage explode it also heavily impacts Ultralytics BBOX models. Very strange behavior. If anyone has an explanation it would appreciated.

Sep 25 '25 09:09 Njaecha

You can install other versions without using 'llama_cpp_install.py'.

Sep 25 '25 12:09 cdmusic2019

This is the problem with comfui itself. Don't look for the problem. It has happened before. Why can V0.3.56 work normally? Don't you think so!

Sep 26 '25 16:09 kajfblsdkgnlsndg

This is the problem with comfui itself. Don't look for the problem. It has happened before. Why can V0.3.56 work normally? Don't you think so!

Sep 26 '25 16:09 kajfblsdkgnlsndg

If the method doesn't work for you, you can try reinstalling comfyui. I also rolled back to 3.56 at first, but later I found the specific reason, and now I have no problem using 3.60.

Sep 27 '25 00:09 cdmusic2019

what is the specific reason,can you tell me? even i reinstalling comfyui without any other node,if i update to new version higher than 3.56,it happend again

Sep 27 '25 06:09 darklight1992

Check your ComfyUI startup parameters.

A friend(4090) had the same issue and asked me for help. After investigating, I found that the problem might be related to cuDNN autotune. If the --fast parameter is present in startup command line, it enables benchmarking functionality and causes SDXL allocate over 24 GB of VRAM when executing Ksampler. I was able to reproduce the issue on my own old build. This problem isn't limited to 30/40-series GPUs. Even 20-series cards (such as my Titan RTX) experience it.

My normal startup parameters

py ComfyUI\main.py --windows-standalone-build --listen 0.0.0.0 --port 58188
pause

Logs

Checkpoint files will always be loaded safely.
Total VRAM 24576 MB, total RAM 130756 MB
pytorch version: 2.8.0+cu128
Set vram state to: NORMAL_VRAM
Device: cuda:0 NVIDIA TITAN RTX : cudaMallocAsync
Using pytorch attention
Python version: 3.13.5 (tags/v3.13.5:6cb20a2, Jun 11 2025, 16:15:46) [MSC v.1943 64 bit (AMD64)]
ComfyUI version: 0.3.62
ComfyUI frontend version: 1.26.13

Reproduce the issue

py ComfyUI\main.py --fast --windows-standalone-build --listen 0.0.0.0 --port 58188
pause

Logs

Checkpoint files will always be loaded safely.
Total VRAM 24576 MB, total RAM 130756 MB
pytorch version: 2.8.0+cu128
Enabled fp16 accumulation.
Set vram state to: NORMAL_VRAM
Device: cuda:0 NVIDIA TITAN RTX : cudaMallocAsync
Using pytorch attention
Python version: 3.13.5 (tags/v3.13.5:6cb20a2, Jun 11 2025, 16:15:46) [MSC v.1943 64 bit (AMD64)]
ComfyUI version: 0.3.62
ComfyUI frontend version: 1.26.13

Solution 1, remove --fast from startup parameters (Recommend)

Simply remove --fast will solve that issue, but you will lose all other optimization. In fact, not much has changed.

EDIT1: With both --fast and --use-sage-attention (need compile lib with VS2022 env) will boost up about 10%~15%

py ComfyUI\main.py --fast --use-sage-attention --cuda-malloc --windows-standalone-build --listen 0.0.0.0 --port 58188
pause

EDIT2: With only --use-sage-attention lose about 3% performance

py ComfyUI\main.py --use-sage-attention --cuda-malloc --windows-standalone-build --listen 0.0.0.0 --port 58188
pause

Solution 2, modify your ComfyUI

WARNING: This method may prevent you upgrading ComfyUI via Git from official channel in future Upgrade to 0.3.62(latest)

Modify comfy/cli_args.py https://github.com/comfyanonymous/ComfyUI/blob/master/comfy/cli_args.py Comment line 146.

class PerformanceFeature(enum.Enum):
    Fp16Accumulation = "fp16_accumulation"
    Fp8MatrixMultiplication = "fp8_matrix_mult"
    CublasOps = "cublas_ops"
    #AutoTune = "autotune"  # Disable autotune

Modify comfy/ops.py https://github.com/comfyanonymous/ComfyUI/blob/master/comfy/ops.py Comment line 55 and 56.

cast_to = comfy.model_management.cast_to #TODO: remove once no more references

#if torch.cuda.is_available() and torch.backends.cudnn.is_available() and PerformanceFeature.AutoTune in args.fast:
    #torch.backends.cudnn.benchmark = True

def cast_to_input(weight, input, non_blocking=False, copy=True):
    return comfy.model_management.cast_to(weight, input.dtype, input.device, non_blocking=non_blocking, copy=copy)

Oct 06 '25 17:10 mirabarukaso

This is something I've been experiencing myself. Ever since v0.3.57 something has changed and it causes erratic spikes in VRAM usage that slow everything down.

Checkpoint files will always be loaded safely.
Total VRAM 24576 MB, total RAM 65299 MB
pytorch version: 2.8.0+cu129
Set vram state to: NORMAL_VRAM
Device: cuda:0 NVIDIA GeForce RTX 3090 : cudaMallocAsync
Using sage attention
Python version: 3.13.6 (tags/v3.13.6:4e66535, Aug  6 2025, 14:36:00) [MSC v.1944 64 bit (AMD64)]
ComfyUI version: 0.3.64

I've ran a test on a clean portable installation, no custom nodes, sage-attention installed, and running a basic SDXL t2i looks like this:

Constant jumps to maximum utilization that then drop off. Erratic VRAM allocation behavior that slows down the entire sampling process, and it's even worse in my usual workflow that also does ControlNet and ESRGAN upscaling, the flow chokes on every step that requires loading a model.

After following @mirabarukaso advice and removing the --fast flag, the problem was solved. 8GB of VRAM got quickly allocated by the model, sampling went on perfectly smooth. v0.3.56 was the last version that was unaffected by this issue, but at least now I know what was the culprit so I can finally update in peace.

Oct 09 '25 11:10 supra107

It works! @mirabarukaso Big Thanks!!!

Oct 13 '25 04:10 Inmanguo

Rolling back to 3.56 was the only thing that fixed this for me. Thanks thread!

Nov 14 '25 03:11 0ucb

Updated to v0.3.72 and re-added --fast fp16_accumulation to the launch parameters. The erratic VRAM allocation issue seems to be gone and the workflow runs as expected. Worth testing further to see if it's resolved.

Nov 25 '25 22:11 supra107