ollama inference should verify models are downloaded before serving

Open dltn opened this issue 1 year ago • 1 comments

Right now, if you start a distribution using remote::ollama without the models downloaded, ollama will attempt to download it upon the first inference request:

INFO:     Uvicorn running on http://[::]:5001 (Press CTRL+C to quit)
INFO:     ::1:60686 - "POST /inference/chat_completion HTTP/1.1" 200 OK
Pulling model: llama3.1:8b-instruct-fp16
10:16:19.251 [INFO] [chat_completion] HTTP Request: GET http://localhost:11434/api/ps "HTTP/1.1 200 OK"
Generator cancelled

However, these models are big – ex. 16GB for the default llama3.1:8b-instruct-fp16 – so the HTTP request will timeout before the download completes (and abort the download).

We should ensure that the models are downloaded and ready on server startup before serving requests.

Oct 04 '24 14:10 dltn

I believe https://github.com/meta-llama/llama-stack/pull/446 fixed this issue, now lllamastack raises ValueError: Model 'llama3.2:3b-instruct-fp16' is not available in Ollama. Available models: llama3.2:latest when no models are available in Ollama.

Should we close this?

Feb 06 '25 10:02 leseb