cortex.cpp bug: Excesive RAM overhead in Cortex when loading a model

Jan version

0.5.15 Win11

Describe the Bug

First, it looked to me that Cortex loads the model twice. Jan's memory usage got out of control, it's approaching a double the amount of the model. I loaded Mistral-Small-24B-Instruct-2501-Q8_0 that is 23.33GB and after loading, memory went up by 38GB, that's 14.67GB overhead!

Perhaps it loads the model with very big context window that balloons the memory usage?

I used to use some models on my laptop with LM Studio and they would work without issues, but today I tried to use them with Jan and they failed due to lack of memory. I then tried to load small 3B model that usually takes 3GB of RAM with LM Studio and noticed that after loading model in Jan, my RAM usage increased ~6GB. So laptop with 16GB of RAM now could not load 7GB model :(

Many users with bootstrapped systems at home are clearly at disadvantage with this memory leak. Also it causes "Model failed to load" error for those users who think that they have enough RAM to run the model. In my experience RAM (VRAM+RAM) requirements in industry are Model size + 1GB.

Steps to Reproduce

No response

Screenshots / Logs

No response

What is your OS?

[ ] MacOS
[x] Windows
[ ] Linux

Feb 24 '25 17:02 mtomas7

A rough estimate, can we know the current model and llama.cpp settings? We are planning to integrate a more accurate estimation tool, rearrange settings, and provide better guidance for improved observation. Like what settings would cause which side effect and what is the benefit of that. E.g. disabling cache or changing the KV cache quantization level reduces memory consumption but is slow.

Feb 27 '25 08:02 louis-jan

64K is pretty high context length. I was using with 8192. I will check how LM Studio and AnythingLLM deals with memory to have a comparison.

My settings (although I see this memory issue with any model):

# BEGIN GENERAL GGUF METADATA
id: Mistral-Small-24B-Instruct-2501 # Model ID unique between models (author / quantization)
model: Mistral-Small-24B-Instruct-2501-Q8_0 # Model ID which is used for request construct - should be unique between models (author / quantization)
name: Mistral-Small-24B-Instruct-2501-Q8_0 # metadata.general.name
version: 2
files:             # Can be relative OR absolute local file path
  - E:\AI\models\bartowski\Mistral-Small-24B-Instruct-2501-GGUF\Mistral-Small-24B-Instruct-2501-Q8_0.gguf
# END GENERAL GGUF METADATA

# BEGIN INFERENCE PARAMETERS
# BEGIN REQUIRED
stop:                # tokenizer.ggml.eos_token_id
  - </s>
# END REQUIRED

# BEGIN OPTIONAL
size: 25054779072
stream: true # Default true?
top_p: 0.95 # Ranges: 0 to 1
temperature: 0.15 # Ranges: 0 to 1
frequency_penalty: 0 # Ranges: 0 to 1
presence_penalty: 0 # Ranges: 0 to 1
max_tokens: 8192 # Should be default to context length
seed: -1
dynatemp_range: 0
dynatemp_exponent: 1
top_k: 40
min_p: 0.05
tfs_z: 1
typ_p: 1
repeat_last_n: 64
repeat_penalty: 1
mirostat: false
mirostat_tau: 5
mirostat_eta: 0.1
penalize_nl: false
ignore_eos: false
n_probs: 0
min_keep: 0
# END OPTIONAL
# END INFERENCE PARAMETERS

# BEGIN MODEL LOAD PARAMETERS
# BEGIN REQUIRED
engine: llama-cpp # engine to run model
prompt_template: "[INST] {system_message}\n[INST] {prompt} [/INST]"
# END REQUIRED

# BEGIN OPTIONAL
ctx_len: 8192 # llama.context_length | 0 or undefined = loaded from model
n_parallel: 1
ngl: 41 # Undefined = loaded from model
# END OPTIONAL
# END MODEL LOAD PARAMETERS

Feb 27 '25 19:02 mtomas7

I remember, when using LM Studio, even if I would try to load a model that exceeds the RAM/VRAM, it wouls still load (although very slow, but still works). Is it possible to use page file for such cases?

Mar 13 '25 21:03 mtomas7

good idea, let me transfer this to cortex instead cc @vansangpfiev @ramonpzg @selim1903

Apr 04 '25 16:04 david-menloai