Patrick Devine

Results 426 comments of Patrick Devine

@MHugonKaliop you can change `OLLAMA_KEEP_ALIVE=-1m` to prevent the model from ever being unloaded. The reason why it's probably in the `Stopping...` state is that it is trying to unload the...

I've found opening a new chat works as well (w/o having to close/reopen the app).

You can also use `ollama ps` to see if some of it is being loaded into system memory instead of onto the GPU. Unfortunately your GPU has just _barely_ enough...

I'm going to go ahead and close this. You shouldn't need to specify the `TEMPLATE` as it should get autodetected.

I ended up rebuilding the q4_1 weights and still ran into issues. In talking with the Llama team, the model is really sensitive to certain quantizations, although they didn't give...

We ended up removing the quantization. I think there was probably an issue also w/ the kv cache. There are some changes coming to improve kv cache performance and I'm...

Going to close this as a dupe.

The safetensors architectures that are currently supported are: - Llama 2 and 3 (not the vision models yet unfortunately) - Gemma 1 and 2 - Bert - Mixtral - Phi3