continue Add keep alive to the embedding model config

Validations

[X] I believe this is a way to improve. I'll try to join the Continue Discord for questions
[X] I'm not able to find an open issue that requests the same enhancement

Problem

The other model configurations allow to set the keep alive (that's really useful with Ollama) and so it would be nice to have that on the embedding model as well.

Solution

Add the keepAlive config under the embed model config

Aug 16 '24 01:08 johnnyasantoss

I don't know at all if it's relevant for you as I don't know your use case. But for anyone in my case: I wanted to prevent the embedding model to unload my Big LLM every time (takes my whole gpu 95%) So, here is a simple workaround:

set the OLLAMA_KEEP_ALIVE env var to -1 //GLOBAL SETTING FOR ALL MODELS
set num_gpu to 0 in the Ollama modelfile for the embedding model

That way:

the embedding model stays always loaded in RAM (which isn't really impactful as they are generally very light)
the Big model stays loaded in VRAM :)

https://github.com/ollama/ollama/blob/main/docs/faq.md#how-do-i-keep-a-model-loaded-in-memory-or-make-it-unload-immediately

Aug 20 '24 21:08 blakkd

Ok, deleting my comment as it's not relevant as I see ollama doesn't actually load the embedding models into the gpu! Even if you put num_gpu 10 or so. So my comment was confusing!

Aug 21 '24 22:08 blakkd

@blakkd thank you anyway. I didn't know that. My usecase is that I want to free the ram back on my m3 mac while the model isn't running. Even thought the models aren't that big, still would be nice to have the mem for other applications without having to close vscode/whatever is using continue

Aug 22 '24 16:08 johnnyasantoss

Oh I see. But then why you want to set the keep_alive? It's meant do exactly the opposite :thinking: However, maybe setting a duration value < 5min (from last inference) instead could help you. Default is 5min.

Aug 22 '24 18:08 blakkd

Yes, that's what I want. Set it to 60s, which is reasonable time between file saves while editing and enough to know that I'm no longer coding.

Aug 22 '24 18:08 johnnyasantoss

(Just to inform it seems the fact ollama wasn't able to loade the embedding models on the GPU was a bug. I don't face it anymore on 0.3.8. I think I was on 0.3.6 before.)

Aug 29 '24 19:08 blakkd

This issue hasn't been updated in 90 days and will be closed after an additional 10 days without activity. If it's still important, please leave a comment and share any new information that would help us address the issue.

Mar 03 '25 04:03 github-actions[bot]

This still relevant

Mar 13 '25 17:03 johnnyasantoss

Hey @johnnyasantoss could you show how are you using keep alive?

models:
  - name: Quen3 14b
    provider: ollama
    model:  quen3:14b
    apiBase: ....
    defaultCompletionOptions:
      keepAlive: 600  # (10 minutes)

This is what I am using and doing ollama ps I still get 30 minutes.

Thanks!

May 28 '25 17:05 pfcouto

could you show how are you using keep alive?

yes, it's the completionOptions setting in the ~/.continue/config.json.


  "completionOptions": {
    "keepAlive": 120
  }

Jul 10 '25 18:07 johnnyasantoss

Hey @johnnyasantoss do you by any change kown, what the filename would be for windows?

Thanks!

Jul 17 '25 16:07 pfcouto

Hey @johnnyasantoss do you by any change kown, what the filename would be for windows?

Idk, but it's probably in %APPDATA%

Jul 22 '25 20:07 johnnyasantoss

This issue hasn't been updated in 90 days and will be closed after an additional 10 days without activity. If it's still important, please leave a comment and share any new information that would help us address the issue.

Oct 21 '25 02:10 github-actions[bot]

This issue was closed because it wasn't updated for 10 days after being marked stale. If it's still important, please reopen + comment and we'll gladly take another look!

Oct 31 '25 02:10 github-actions[bot]