ollama support
What would you like to be added:
ollama provides sdk for integrations, we can easily integrate with it, one of the benefits I can think of is ollama maintains a bunch of quantized models, we can leverage.
Why is this needed:
Ecosystem integration.
Completion requirements:
This enhancement requires the following artifacts:
- [ ] Design doc
- [ ] API change
- [x] Docs update
The artifacts should be linked in subsequent comments.
/kind feature
~Because of ollama doesn't provide http servers, one way to integrate with it is to support URI with ollama protocol and inference with llama.cpp~
RE: it supports rest server, see https://github.com/ollama/ollama/blob/main/docs/api.md
/assign @qinguoyi
i will finish this work util 11.2
Hey @qinguoyi if you have any design details, better to share in this issue, we can discuss about that to avoid unnecessary refactorings. Thanks!
Let's see some backgorund,
- how llmaz runs
- First download the model file, and then infer based on the model file
- what ollama supports
- support run direct,which can load file from ollama self repo.
- support import custom model files, which is not in ollama repo.
- guff
- safetensors(import direct; import with tuned adapter)
- quantizing other type model file
so, Considering the llmaz, our goal is to support ollama importing custom model file inference,including guff and safetensors(import direct)
Let's see the difficult to impl,
https://github.com/ollama/ollama/blob/main/docs/import.md , according to the official docs, if we import custom model file, we need exec some shelll cmd after start ollama server:
- make the file named Modelfile
- ollama crate modelName -f Modelfile
- ollama run modelName
Let we see ollama offlical image cmd,
according the image inspect, we can see there is olny to start ollama server. so, The difficulty is how to execute multiple commands and import custom model files for inference while starting the image.
Let's see how to do it
If you want to execute multiple commands, try to use shell instead of other languages like python, because almost all images will have shell like sh or bash.
We have two containers, an init container for downloading models and a main container for starting the inference service; so, we have two solutions to implement it.
- The first one is to rebuild a new image based on the official image and inject the shell script into it. This is not flexible enough. If the official image is updated, we need to rebuild it.
- The second one is to add a logic to copy the script file to the models directory when the init container downloads the model, that is, the script directory is mounted to the models directory, so that it can be expanded for more scripts in the future.
In summary, we choose the second method to implement it. The specific script commands are as follows:
#!/bin/bash
# start ollama server
ollama serve &
# ensure server is normal
sleep 5
# check input params
if [ -z "$1" ]; then
echo "please input GGUF model file path,such as:./start_ollama.sh /path/to/model.gguf"
exit 1
fi
MODEL_PATH=$1
# judge input is file path or dir path
if [ -f "$MODEL_PATH" ]; then
echo "input file path:$MODEL_PATH"
# judge whether the path is suffix with .gguf or not
if [ -f "$MODEL_PATH" ]; then
if [[ "$MODEL_PATH" == *.gguf ]]; then
echo "file exist and suffix with .guff :$MODEL_PATH"
else
echo "file exist but not suffix with .guff:$MODEL_PATH"
eixt 1
fi
else
echo "file is not exist:$MODEL_PATH"
exit 1
fi
elif [ -d "$MODEL_PATH" ]; then
echo "input dir path:$MODEL_PATH"
if [ -d "$MODEL_PATH" ]; then
# judge whether has suffix with .safetensors in the dir or not
SAFETENSORS_FILES=$(find "$MODEL_PATH" -type f -name "*.safetensors")
if [ -z "$SAFETENSORS_FILES" ]; then
echo "dir exists but there no file suffix with .safetensors"
exit 1
else
echo "dir exists and there has suffix with .safetensors in the dir:"
echo "$SAFETENSORS_FILES"
fi
else
echo "dir is not exist:$MODEL_PATH"
exit 1
fi
else
echo "input path is not file and not dir:$MODEL_PATH"
exit 1
fi
# create modelfile
MODEL_FILE="Modelfile"
cat <<EOF > $MODEL_FILE
FROM "$MODEL_PATH"
EOF
echo "create modelfile success"
cat $MODEL_FILE
# run Ollama create
if [ -z "$2" ]; then
echo "please input model name"
exit 1
fi
MODEL_PATH=$2
ollama create $MODEL_PATH -f Modelfile
if [ $? -ne 0 ]; then
echo "run ollama create occur error"
exit 1
fi
# run Ollama run
ollama run mymodel
# ensure the shell is not exit,avoid the process exit
while true; do
sleep 3600
done
Let's see the result,
Here we take the guff file mounting as an example. In order to start faster, we use the minimized image alpine/ollala:latest
- playground.yaml
{{- if .Values.backendRuntime.install -}}
apiVersion: inference.llmaz.io/v1alpha1
kind: BackendRuntime
metadata:
labels:
app.kubernetes.io/name: backendruntime
app.kubernetes.io/part-of: llmaz
app.kubernetes.io/created-by: llmaz
name: ollama
spec:
commands:
- sh
- /workspace/models/llmaz-scripts/start_ollama.sh
image: alpine/ollama
version: latest
# Do not edit the preset argument name unless you know what you're doing.
# Free to add more arguments with your requirements.
args:
- name: default
flags:
- "{{`{{ .ModelPath }}`}}"
- "{{`{{ .ModelName }}`}}"
resources:
requests:
cpu: 2
memory: 4Gi
limits:
cpu: 4
memory: 8Gi
{{- end }}
Let us to port-forwrd the 11434 to 8080:
so, this is my idea to support ollama. i wolud like to know more idea to support elegant. PTAL @kerthcet
Thanks for the detailed information, it's really clear. Based on the fact that ollama is mostly designed for local deploy, but not for cloud, and it's based on llama.cpp, we already supported that, so my suggestion is let's start with the simplest approach and see whether this is popular with users, then step to next level based on feedbacks, rather than make it a perfect one at day1. So maybe we can start with Ignore the Modelfile and run the ollama command directly? In this way, we can leverage the ollama models in the library.
Again, from what I've learned so far, I didn't see a lot of users deploy ollama in the cloud, this is a suboptimal solution, just because we can easy to integrate with inference backends, so I make it a TODO work. wdyt?
Thanks for your kind reply.
I figured out what had confused me so much, which was that I thought the modelfile was mandatory.
in additional , I have no idea how to ignore the modelfiles.
for example, we can add a Ignore field in plyaground, when ignore is true, we can only run playground not binding model?
but, playground, service and backendruntime controller have a lot of code binding model and model[0], there will be many work to ignore the model.
do you have any suggestions for implementation?
A simple implementation would like:
- Use the ollama image for base image
- The model would like below, then we know we're importing models from ollama
source: uri: ollama://qwen2:0.5b - The command would like
ollama run qwen2:0.5b, which is templated via the backendRuntime - We can inference the model via request like below, of course, we need to change the port.
curl http://localhost:11434/api/generate -d '{ "model": "qwen2:0.5b", "prompt":"Why is the sky blue?" }' - Once we found we're inference models from ollama, then no longer need to add an init container to download the model in prior, as mentioned this is the simplest implementation, no cache for the moment, we can add it at anytime users asked for.
Any suggestions?
I fully agree, this seems like the least invasive solution. I'll work on getting it done as soon as possible
i will finish this work util 11.2
I am sorry for late commit. I make a pr there https://github.com/InftyAI/llmaz/pull/193, PTAL @kerthcet , thanks.
Could we close this issue?@kerthcet
Yes, we can. One tip, you can set the PR description like fix #xxx then the issue will be closed as soon as the PR is merged. Better not to remove the fix.
/close