ollama inference should verify models are downloaded before serving
Right now, if you start a distribution using remote::ollama without the models downloaded, ollama will attempt to download it upon the first inference request:
INFO: Uvicorn running on http://[::]:5001 (Press CTRL+C to quit)
INFO: ::1:60686 - "POST /inference/chat_completion HTTP/1.1" 200 OK
Pulling model: llama3.1:8b-instruct-fp16
10:16:19.251 [INFO] [chat_completion] HTTP Request: GET http://localhost:11434/api/ps "HTTP/1.1 200 OK"
Generator cancelled
However, these models are big – ex. 16GB for the default llama3.1:8b-instruct-fp16 – so the HTTP request will timeout before the download completes (and abort the download).
We should ensure that the models are downloaded and ready on server startup before serving requests.
I believe https://github.com/meta-llama/llama-stack/pull/446 fixed this issue, now lllamastack raises ValueError: Model 'llama3.2:3b-instruct-fp16' is not available in Ollama. Available models: llama3.2:latest when no models are available in Ollama.
Should we close this?