[bug]: Models constantly unload after every generation
Is there an existing issue for this problem?
- [x] I have searched the existing issues
Operating system
Windows
GPU vendor
Nvidia (CUDA)
GPU model
No response
GPU VRAM
No response
Version number
5.15
Browser
Firefox
Python dependencies
No response
What happened
Models are immediately removed from ram after every generation. This includes during batch generations or performing multiple iterations. So if I have a queue of 10 images to be generated the models are loaded and unloaded 10 times. I have tried setting lazy_offload to true with no effect. As far as I can tell there is no setting or option to stop this from happening.
What you expected to happen
I would expect the model to stay in memory for at least a short time. Or at the very least during a batch run of images. It slows down the process of generating multiple images. It should also be a feature that is on by default.
How to reproduce the problem
No response
Additional context
No response
Discord username
No response
What is your VRAM/RAM cache set at?
What is your VRAM/RAM cache set at?
And where exactly do I find those details in InvokeAI? I cant find it in the settings page. The config file I am using is the default one that came with a fresh install.
I have noticed if I use a model that fits entirely in VRAM 16gb then it does not reload. If it uses any system ram at all it will immediately unload once image generation is finished, regardless if there are more images in the queue.
For some reason the installer did install version 5.15. I have just updated it to 6.0.1. However the behavior is the same.
And where exactly do I find those details in InvokeAI? I cant find it in the settings page. The config file I am using is the default one that came with a fresh install.
Click the gear icon @ bottom left -> About -> copy the system info
Gear icon -> About -> no system info. There is a open section with what looks like a json file in it. It has "version" "dependencies" and "config". Beside that is some details about the program itself. I am going to assume you mean you want the config section from that json section in the about? After looking through it I have noticed it has set vram and ram both to "null" as well as the max_cache_ram_gb and max_cache_vram_gb are both also set to null. Here is a paste of the config. "config": { "schema_version": "4.0.2", "legacy_models_yaml_path": null, "host": "127.0.0.1", "port": 9091, "allow_origins": [], "allow_credentials": true, "allow_methods": [""], "allow_headers": [""], "ssl_certfile": null, "ssl_keyfile": null, "log_tokenization": false, "patchmatch": true, "models_dir": "models", "convert_cache_dir": "models\.convert_cache", "download_cache_dir": "models\.download_cache", "legacy_conf_dir": "configs", "db_dir": "databases", "outputs_dir": "outputs", "custom_nodes_dir": "nodes", "style_presets_dir": "style_presets", "workflow_thumbnails_dir": "workflow_thumbnails", "log_handlers": ["console"], "log_format": "color", "log_level": "info", "log_sql": false, "log_level_network": "warning", "use_memory_db": false, "dev_reload": false, "profile_graphs": false, "profile_prefix": null, "profiles_dir": "profiles", "max_cache_ram_gb": null, "max_cache_vram_gb": null, "log_memory_usage": false, "device_working_mem_gb": 3, "enable_partial_loading": false, "keep_ram_copy_of_weights": true, "ram": null, "vram": null, "lazy_offload": true, "pytorch_cuda_alloc_conf": null, "device": "auto", "precision": "auto", "sequential_guidance": false, "attention_type": "auto", "attention_slice_size": "auto", "force_tiled_decode": false, "pil_compress_level": 1, "max_queue_size": 10000, "clear_queue_on_startup": false, "allow_nodes": null, "deny_nodes": null, "node_cache_size": 512, "hashing_algorithm": "blake3_single", "remote_api_tokens": null, "scan_models_on_startup": false },
This happens to my setup with a 5090 and 128Gb RAM as well.
FLUX Dev just keeps reloading every model every time. Other models also get reloaded all the time. Seeing as the creator of the issue did not provide that many info, let me send you more details.
I've seen a warning like that in log, so this might be the culprit:
WARNING --> [MODEL CACHE] Failed to calculate model size for unexpected model type: <class 'transformers.tokenization_utils_fast.PreTrainedTokenizerFast'>. The model will be treated as having size 0.
VRAM:
RAM:
Startup log:
Starting the InvokeAI browser-based UI..
[InvokeAI]::INFO --> Using torch device: NVIDIA GeForce RTX 5090
[InvokeAI]::INFO --> cuDNN version: 90701
[InvokeAI]::INFO --> Patchmatch initialized
[InvokeAI]::INFO --> InvokeAI version 6.8.0
[InvokeAI]::INFO --> Root directory = C:\invokeai
[InvokeAI]::INFO --> Initializing database at C:\invokeai\databases\invokeai.db
[ModelManagerService]::INFO --> [MODEL CACHE] Calculated model RAM cache size: 29534.56 MB. Heuristics applied: [1, 2].
[InvokeAI]::INFO --> Executing queue item 58105, session 194796df-fa5d-44b7-ba4c-21d4916e253c
`torch_dtype` is deprecated! Use `dtype` instead!
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 26.13it/s]
[InvokeAI]::INFO --> Cleaned database (freed 1.17MB)
[InvokeAI]::INFO --> Invoke running on http://0.0.0.0:9090 (Press CTRL+C to quit)
[ModelManagerService]::INFO --> [MODEL CACHE] Loaded model 'c9b3f5a7-fad2-4707-b0fb-e0fdc53f8859:text_encoder' (GlmModel) onto cuda device in 42.10s. Total model size: 16744.98MB, VRAM: 16744.98MB (100.0%)
[ModelManagerService]::WARNING --> [MODEL CACHE] Failed to calculate model size for unexpected model type: <class 'transformers.tokenization_utils_fast.PreTrainedTokenizerFast'>. The model will be treated as having size 0.
[ModelManagerService]::INFO --> [MODEL CACHE] Loaded model 'c9b3f5a7-fad2-4707-b0fb-e0fdc53f8859:tokenizer' (PreTrainedTokenizerFast) onto cuda device in 0.00s. Total model size: 0.00MB, VRAM: 0.00MB (0.0%)
[ModelManagerService]::INFO --> [MODEL CACHE] Loaded model 'c9b3f5a7-fad2-4707-b0fb-e0fdc53f8859:text_encoder' (GlmModel) onto cuda device in 0.00s. Total model size: 16744.98MB, VRAM: 16744.98MB (100.0%)
[ModelManagerService]::INFO --> [MODEL CACHE] Loaded model 'c9b3f5a7-fad2-4707-b0fb-e0fdc53f8859:tokenizer' (PreTrainedTokenizerFast) onto cuda device in 0.00s. Total model size: 0.00MB, VRAM: 0.00MB (0.0%)
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 28.58it/s]
[ModelManagerService]::INFO --> [MODEL CACHE] Loaded model 'c9b3f5a7-fad2-4707-b0fb-e0fdc53f8859:transformer' (CogView4Transformer2DModel) onto cuda device in 30.14s. Total model size: 12148.13MB, VRAM: 12148.13MB (100.0%)
100%|██████████████████████████████████████████████████████████████████████████████████| 30/30 [00:26<00:00, 1.14it/s]
estimate_vae_working_memory_cogview4: 4613734400
[ModelManagerService]::INFO --> [MODEL CACHE] Loaded model 'c9b3f5a7-fad2-4707-b0fb-e0fdc53f8859:vae' (AutoencoderKL) onto cuda device in 1.90s. Total model size: 774.58MB, VRAM: 774.58MB (100.0%)
[InvokeAI]::INFO --> Graph stats: 194796df-fa5d-44b7-ba4c-21d4916e253c
Node Calls Seconds VRAM Used
cogview4_model_loader 1 0.004s 0.000G
cogview4_text_encoder 2 43.973s 16.444G
string 1 0.001s 16.400G
integer 1 0.001s 16.400G
cogview4_denoise 1 56.637s 16.402G
core_metadata 1 0.001s 11.872G
cogview4_l2i 1 4.258s 17.385G
TOTAL GRAPH EXECUTION TIME: 104.875s
TOTAL GRAPH WALL TIME: 104.878s
RAM used by InvokeAI process: 14.63G (+13.769G)
RAM used to load models: 28.97G
VRAM in use: 12.630G
RAM cache statistics:
Model cache hits: 6
Model cache misses: 4
Models cached: 3
Models cleared from cache: 1
Cache high water mark: 28.22/0.00G
[InvokeAI]::INFO --> Executing queue item 58106, session e7b59abd-bd2a-4692-9cfe-87fcbf21b074
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 92.61it/s]
[ModelManagerService]::INFO --> [MODEL CACHE] Loaded model 'c9b3f5a7-fad2-4707-b0fb-e0fdc53f8859:text_encoder' (GlmModel) onto cuda device in 5.73s. Total model size: 16744.98MB, VRAM: 16744.98MB (100.0%)
[ModelManagerService]::WARNING --> [MODEL CACHE] Failed to calculate model size for unexpected model type: <class 'transformers.tokenization_utils_fast.PreTrainedTokenizerFast'>. The model will be treated as having size 0.
[ModelManagerService]::INFO --> [MODEL CACHE] Loaded model 'c9b3f5a7-fad2-4707-b0fb-e0fdc53f8859:tokenizer' (PreTrainedTokenizerFast) onto cuda device in 0.00s. Total model size: 0.00MB, VRAM: 0.00MB (0.0%)
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 129.44it/s]
[ModelManagerService]::INFO --> [MODEL CACHE] Loaded model 'c9b3f5a7-fad2-4707-b0fb-e0fdc53f8859:transformer' (CogView4Transformer2DModel) onto cuda device in 4.09s. Total model size: 12148.13MB, VRAM: 12148.13MB (100.0%)
100%|██████████████████████████████████████████████████████████████████████████████████| 30/30 [00:24<00:00, 1.24it/s]
estimate_vae_working_memory_cogview4: 4613734400
[ModelManagerService]::INFO --> [MODEL CACHE] Loaded model 'c9b3f5a7-fad2-4707-b0fb-e0fdc53f8859:vae' (AutoencoderKL) onto cuda device in 0.26s. Total model size: 774.58MB, VRAM: 774.58MB (100.0%)
[InvokeAI]::INFO --> Graph stats: e7b59abd-bd2a-4692-9cfe-87fcbf21b074
Node Calls Seconds VRAM Used
cogview4_model_loader 1 0.001s 12.630G
cogview4_text_encoder 2 7.673s 17.213G
string 1 0.000s 12.630G
integer 1 0.000s 17.161G
cogview4_denoise 1 28.656s 17.161G
core_metadata 1 0.001s 11.873G
cogview4_l2i 1 2.267s 17.386G
TOTAL GRAPH EXECUTION TIME: 38.598s
TOTAL GRAPH WALL TIME: 38.600s
RAM used by InvokeAI process: 14.66G (+0.030G)
RAM used to load models: 28.97G
VRAM in use: 12.631G
RAM cache statistics:
Model cache hits: 4
Model cache misses: 4
Models cached: 3
Models cleared from cache: 1
Cache high water mark: 28.22/0.00G
One more generation log, just in case:
[InvokeAI]::INFO --> Executing queue item 58207, session 614452a8-ddee-4463-9328-a74e8899279c
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 85.05it/s]
[ModelManagerService]::INFO --> [MODEL CACHE] Loaded model '9299a25a-4111-489e-97f8-fcfd098ef0b1:text_encoder_2' (T5EncoderModel) onto cuda device in 3.14s. Total model size: 9083.39MB, VRAM: 9083.39MB (100.0%)
[ModelManagerService]::INFO --> [MODEL CACHE] Loaded model '9299a25a-4111-489e-97f8-fcfd098ef0b1:tokenizer_2' (T5TokenizerFast) onto cuda device in 0.00s. Total model size: 0.03MB, VRAM: 0.00MB (0.0%)
[ModelManagerService]::INFO --> [MODEL CACHE] Loaded model '3bad88b6-c43a-468d-907c-2ebf6b870366:text_encoder' (CLIPTextModel) onto cuda device in 0.06s. Total model size: 469.44MB, VRAM: 469.44MB (100.0%)
[ModelManagerService]::INFO --> [MODEL CACHE] Loaded model '3bad88b6-c43a-468d-907c-2ebf6b870366:tokenizer' (CLIPTokenizer) onto cuda device in 0.00s. Total model size: 0.00MB, VRAM: 0.00MB (0.0%)
[ModelManagerService]::INFO --> [MODEL CACHE] Loaded model '8a48478d-8209-4755-80e8-212be678a68e:transformer' (Flux) onto cuda device in 7.96s. Total model size: 22700.13MB, VRAM: 22700.13MB (100.0%)
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 30/30 [00:15<00:00, 1.93it/s]
estimate_vae_working_memory_flux: 4613734400
[ModelManagerService]::INFO --> [MODEL CACHE] Loaded model '1b3e7358-9440-4f9b-8d15-462f1636fc1c:vae' (AutoEncoder) onto cuda device in 0.03s. Total model size: 159.87MB, VRAM: 159.87MB (100.0%)
[InvokeAI]::INFO --> Graph stats: 614452a8-ddee-4463-9328-a74e8899279c
Node Calls Seconds VRAM Used
flux_model_loader 1 0.000s 22.887G
string 1 0.001s 22.887G
flux_text_encoder 1 5.760s 22.887G
collect 1 0.001s 9.590G
integer 1 0.001s 9.590G
flux_denoise 1 24.588s 23.298G
core_metadata 1 0.001s 22.722G
flux_vae_decode 1 0.428s 25.015G
TOTAL GRAPH EXECUTION TIME: 30.780s
TOTAL GRAPH WALL TIME: 30.781s
RAM used by InvokeAI process: 25.21G (-0.000G)
RAM used to load models: 31.65G
VRAM in use: 22.887G
RAM cache statistics:
Model cache hits: 6
Model cache misses: 6
Models cached: 5
Models cleared from cache: 2
Cache high water mark: 22.78/0.00G
Info:
{
"version": "6.8.0",
"dependencies": {
"absl-py" : "2.3.1",
"accelerate" : "1.10.1",
"annotated-types" : "0.7.0",
"anyio" : "4.11.0",
"attrs" : "25.4.0",
"bidict" : "0.23.1",
"bitsandbytes" : "0.48.1",
"blake3" : "1.0.7",
"certifi" : "2022.12.7",
"cffi" : "2.0.0",
"charset-normalizer" : "2.1.1",
"click" : "8.3.0",
"colorama" : "0.4.6",
"coloredlogs" : "15.0.1",
"compel" : "2.1.1",
"contourpy" : "1.3.3",
"CUDA" : "12.8",
"cycler" : "0.12.1",
"Deprecated" : "1.2.18",
"diffusers" : "0.33.0",
"dnspython" : "2.8.0",
"dynamicprompts" : "0.31.0",
"einops" : "0.8.1",
"fastapi" : "0.118.2",
"fastapi-events" : "0.12.2",
"filelock" : "3.13.1",
"flatbuffers" : "25.9.23",
"fonttools" : "4.60.1",
"fsspec" : "2024.6.1",
"gguf" : "0.17.1",
"h11" : "0.16.0",
"httptools" : "0.6.4",
"huggingface-hub" : "0.35.3",
"humanfriendly" : "10.0",
"idna" : "3.4",
"importlib_metadata" : "7.1.0",
"InvokeAI" : "6.8.0",
"jax" : "0.7.1",
"jaxlib" : "0.7.1",
"Jinja2" : "3.1.4",
"kiwisolver" : "1.4.9",
"MarkupSafe" : "2.1.5",
"matplotlib" : "3.10.7",
"mediapipe" : "0.10.14",
"ml_dtypes" : "0.5.3",
"mpmath" : "1.3.0",
"networkx" : "3.3",
"numpy" : "1.26.3",
"onnx" : "1.16.1",
"onnxruntime" : "1.19.2",
"opencv-contrib-python": "4.11.0.86",
"opt_einsum" : "3.4.0",
"packaging" : "24.1",
"picklescan" : "0.0.31",
"pillow" : "11.0.0",
"prompt_toolkit" : "3.0.52",
"protobuf" : "4.25.8",
"psutil" : "7.1.0",
"pycparser" : "2.23",
"pydantic" : "2.11.10",
"pydantic-settings" : "2.11.0",
"pydantic_core" : "2.33.2",
"pyparsing" : "3.2.5",
"PyPatchMatch" : "1.0.2",
"pyreadline3" : "3.5.4",
"python-dateutil" : "2.9.0.post0",
"python-dotenv" : "1.1.1",
"python-engineio" : "4.12.3",
"python-multipart" : "0.0.20",
"python-socketio" : "5.14.1",
"PyWavelets" : "1.9.0",
"PyYAML" : "6.0.3",
"regex" : "2025.9.18",
"requests" : "2.28.1",
"safetensors" : "0.6.2",
"scipy" : "1.16.2",
"semver" : "3.0.4",
"sentencepiece" : "0.2.0",
"setuptools" : "70.2.0",
"simple-websocket" : "1.1.0",
"six" : "1.17.0",
"sniffio" : "1.3.1",
"sounddevice" : "0.5.2",
"spandrel" : "0.4.1",
"starlette" : "0.48.0",
"sympy" : "1.13.3",
"tokenizers" : "0.22.1",
"torch" : "2.7.1+cu128",
"torchsde" : "0.2.6",
"torchvision" : "0.22.1+cu128",
"tqdm" : "4.66.5",
"trampoline" : "0.1.2",
"transformers" : "4.57.0",
"typing-inspection" : "0.4.2",
"typing_extensions" : "4.12.2",
"urllib3" : "1.26.13",
"uvicorn" : "0.37.0",
"watchfiles" : "1.1.0",
"wcwidth" : "0.2.14",
"websockets" : "15.0.1",
"wrapt" : "1.17.3",
"wsproto" : "1.2.0",
"zipp" : "3.19.2"
},
"config": {
"schema_version": "4.0.2",
"legacy_models_yaml_path": null,
"host": "0.0.0.0",
"port": 9090,
"allow_origins": [],
"allow_credentials": true,
"allow_methods": ["*"],
"allow_headers": ["*"],
"ssl_certfile": null,
"ssl_keyfile": null,
"log_tokenization": false,
"patchmatch": true,
"models_dir": "models",
"convert_cache_dir": "models\\.convert_cache",
"download_cache_dir": "models\\.download_cache",
"legacy_conf_dir": "configs",
"db_dir": "databases",
"outputs_dir": "C:\\invokeai\\outputs",
"custom_nodes_dir": "nodes",
"style_presets_dir": "style_presets",
"workflow_thumbnails_dir": "workflow_thumbnails",
"log_handlers": ["console"],
"log_format": "color",
"log_level": "info",
"log_sql": false,
"log_level_network": "warning",
"use_memory_db": false,
"dev_reload": false,
"profile_graphs": false,
"profile_prefix": null,
"profiles_dir": "profiles",
"max_cache_ram_gb": null,
"max_cache_vram_gb": null,
"log_memory_usage": false,
"device_working_mem_gb": 3,
"enable_partial_loading": false,
"keep_ram_copy_of_weights": true,
"ram": 64,
"vram": null,
"lazy_offload": true,
"pytorch_cuda_alloc_conf": null,
"device": "auto",
"precision": "auto",
"sequential_guidance": false,
"attention_type": "auto",
"attention_slice_size": "auto",
"force_tiled_decode": false,
"pil_compress_level": 1,
"max_queue_size": 10000,
"clear_queue_on_startup": false,
"allow_nodes": null,
"deny_nodes": null,
"node_cache_size": 512,
"hashing_algorithm": "blake3_single",
"remote_api_tokens": null,
"scan_models_on_startup": false,
"unsafe_disable_picklescan": false
},
"set_config_fields": ["legacy_models_yaml_path", "host", "ram", "outputs_dir"]
}
The script I use to start InvokeAI:
@echo off
PUSHD "%~dp0"
setlocal
call .venv\Scripts\activate.bat
set INVOKEAI_ROOT=.
:start
echo Desired action:
echo 1. Generate images with the browser-based interface
echo 2. Open the developer console
echo 3. Command-line help
echo Q - Quit
echo.
echo To update, download and run the installer from https://github.com/invoke-ai/InvokeAI/releases/latest
echo.
set /P choice="Please enter 1-4, Q: [1] "
if not defined choice set choice=1
IF /I "%choice%" == "1" (
echo Starting the InvokeAI browser-based UI..
python .venv\Scripts\invokeai-web.exe %*
) ELSE IF /I "%choice%" == "2" (
echo Developer Console
echo Python command is:
where python
echo Python version is:
python --version
echo *************************
echo You are now in the system shell, with the local InvokeAI Python virtual environment activated,
echo so that you can troubleshoot this InvokeAI installation as necessary.
echo *************************
echo *** Type `exit` to quit this shell and deactivate the Python virtual environment ***
call cmd /k
) ELSE IF /I "%choice%" == "3" (
echo Displaying command line help...
python .venv\Scripts\invokeai-web.exe --help %*
pause
exit /b
) ELSE IF /I "%choice%" == "q" (
echo Goodbye!
goto ending
) ELSE (
echo Invalid selection
pause
exit /b
)
goto start
endlocal
pause
:ending
exit /b
ComfyUI works fine, with no spikes like these.
I'd be more than happy to provide whatever info is needed to fix this issue, or test some workarounds.
I tried adding this to my config:
ram: 64.0
vram: 31.0
max_cache_ram_gb: 64.0
max_cache_vram_gb: 31.0
And it solved the issue with models being constantly reloaded in RAM. Nice and flat:
But the VRAM usage graph still looks the same. I guess, I might not have enough VRAM after all. Even though it says VRAM in use: 22.416G, and I've got like 10Gb more.