llmaz ollama support

What would you like to be added:

ollama provides sdk for integrations, we can easily integrate with it, one of the benefits I can think of is ollama maintains a bunch of quantized models, we can leverage.

Why is this needed:

Ecosystem integration.

Completion requirements:

This enhancement requires the following artifacts:

[ ] Design doc
[ ] API change
[x] Docs update

The artifacts should be linked in subsequent comments.

Aug 17 '24 00:08 kerthcet

/kind feature

Aug 17 '24 00:08 kerthcet

~Because of ollama doesn't provide http servers, one way to integrate with it is to support URI with ollama protocol and inference with llama.cpp~

RE: it supports rest server, see https://github.com/ollama/ollama/blob/main/docs/api.md

Aug 17 '24 00:08 kerthcet

/assign @qinguoyi

Oct 27 '24 13:10 qinguoyi

i will finish this work util 11.2

Oct 27 '24 13:10 qinguoyi

Hey @qinguoyi if you have any design details, better to share in this issue, we can discuss about that to avoid unnecessary refactorings. Thanks！

Oct 28 '24 02:10 kerthcet

Let's see some backgorund,

how llmaz runs
- First download the model file, and then infer based on the model file
what ollama supports
- support run direct，which can load file from ollama self repo.
- support import custom model files, which is not in ollama repo.
  - guff
  - safetensors(import direct; import with tuned adapter)
  - quantizing other type model file

so, Considering the llmaz, our goal is to support ollama importing custom model file inference，including guff and safetensors(import direct)

Let's see the difficult to impl,

https://github.com/ollama/ollama/blob/main/docs/import.md , according to the official docs, if we import custom model file, we need exec some shelll cmd after start ollama server:

make the file named Modelfile
ollama crate modelName -f Modelfile
ollama run modelName

Let we see ollama offlical image cmd,

according the image inspect, we can see there is olny to start ollama server. so, The difficulty is how to execute multiple commands and import custom model files for inference while starting the image.

Let's see how to do it

If you want to execute multiple commands, try to use shell instead of other languages like python, because almost all images will have shell like sh or bash.

We have two containers, an init container for downloading models and a main container for starting the inference service; so, we have two solutions to implement it.

The first one is to rebuild a new image based on the official image and inject the shell script into it. This is not flexible enough. If the official image is updated, we need to rebuild it.
The second one is to add a logic to copy the script file to the models directory when the init container downloads the model, that is, the script directory is mounted to the models directory, so that it can be expanded for more scripts in the future.

In summary, we choose the second method to implement it. The specific script commands are as follows:

#!/bin/bash
# start ollama server
ollama serve &

# ensure server is normal
sleep 5

# check input params
if [ -z "$1" ]; then
    echo "please input GGUF model file path，such as：./start_ollama.sh /path/to/model.gguf"
    exit 1
fi

MODEL_PATH=$1

# judge input is file path or dir path
if [ -f "$MODEL_PATH" ]; then
    echo "input file path：$MODEL_PATH"
    # judge whether the path is suffix with .gguf or not  
    if [ -f "$MODEL_PATH" ]; then
        if [[ "$MODEL_PATH" == *.gguf ]]; then
            echo "file exist and suffix with .guff ：$MODEL_PATH"
        else
            echo "file exist but not suffix with .guff：$MODEL_PATH"
            eixt 1
        fi
    else
        echo "file is not exist：$MODEL_PATH"
        exit 1
    fi

elif [ -d "$MODEL_PATH" ]; then
       echo "input dir path：$MODEL_PATH"

    if [ -d "$MODEL_PATH" ]; then
        # judge whether has suffix with .safetensors in the dir or not 
        SAFETENSORS_FILES=$(find "$MODEL_PATH" -type f -name "*.safetensors")

        if [ -z "$SAFETENSORS_FILES" ]; then
            echo "dir exists but there no file suffix with .safetensors"
            exit 1
        else
            echo "dir exists and there has suffix with .safetensors in the dir："
            echo "$SAFETENSORS_FILES"
        fi
    else
        echo "dir is not exist：$MODEL_PATH"
        exit 1
    fi
else
    echo "input path is not file and not dir：$MODEL_PATH"
    exit 1
fi


# create modelfile
MODEL_FILE="Modelfile"
cat <<EOF > $MODEL_FILE
FROM "$MODEL_PATH"
EOF

echo "create modelfile success"
cat $MODEL_FILE

# run Ollama create
if [ -z "$2" ]; then
    echo "please input model name"
    exit 1
fi
MODEL_PATH=$2
ollama create $MODEL_PATH -f Modelfile
if [ $? -ne 0 ]; then
    echo "run ollama create occur error"
    exit 1
fi

# run Ollama run 
ollama run mymodel

# ensure the shell is not exit,avoid the process exit
while true; do
    sleep 3600
done

Let's see the result,

Here we take the guff file mounting as an example. In order to start faster, we use the minimized image alpine/ollala:latest

playground.yaml

{{- if .Values.backendRuntime.install -}}
apiVersion: inference.llmaz.io/v1alpha1
kind: BackendRuntime
metadata:
  labels:
    app.kubernetes.io/name: backendruntime
    app.kubernetes.io/part-of: llmaz
    app.kubernetes.io/created-by: llmaz
  name: ollama
spec:
  commands:
    - sh
    - /workspace/models/llmaz-scripts/start_ollama.sh
  image: alpine/ollama
  version: latest
  # Do not edit the preset argument name unless you know what you're doing.
  # Free to add more arguments with your requirements.
  args:
    - name: default
      flags:
        - "{{`{{ .ModelPath }}`}}"
        - "{{`{{ .ModelName }}`}}"
  resources:
    requests:
      cpu: 2
      memory: 4Gi
    limits:
      cpu: 4
      memory: 8Gi
{{- end }}

Let us to port-forwrd the 11434 to 8080:

so, this is my idea to support ollama. i wolud like to know more idea to support elegant. PTAL @kerthcet

Oct 28 '24 05:10 qinguoyi

Thanks for the detailed information, it's really clear. Based on the fact that ollama is mostly designed for local deploy, but not for cloud, and it's based on llama.cpp, we already supported that, so my suggestion is let's start with the simplest approach and see whether this is popular with users, then step to next level based on feedbacks, rather than make it a perfect one at day1. So maybe we can start with Ignore the Modelfile and run the ollama command directly? In this way, we can leverage the ollama models in the library.

Again, from what I've learned so far, I didn't see a lot of users deploy ollama in the cloud, this is a suboptimal solution, just because we can easy to integrate with inference backends, so I make it a TODO work. wdyt?

Oct 28 '24 08:10 kerthcet

Thanks for your kind reply.

I figured out what had confused me so much, which was that I thought the modelfile was mandatory.

in additional , I have no idea how to ignore the modelfiles.

for example, we can add a Ignore field in plyaground, when ignore is true, we can only run playground not binding model?

but, playground, service and backendruntime controller have a lot of code binding model and model[0], there will be many work to ignore the model.

do you have any suggestions for implementation?

Oct 28 '24 13:10 qinguoyi

A simple implementation would like:

Use the ollama image for base image
The model would like below, then we know we're importing models from ollama
```
  source:
    uri: ollama://qwen2:0.5b
```
The command would like ollama run qwen2:0.5b, which is templated via the backendRuntime

We can inference the model via request like below, of course, we need to change the port.

curl http://localhost:11434/api/generate -d '{
"model": "qwen2:0.5b",
"prompt":"Why is the sky blue?"
}'

Once we found we're inference models from ollama, then no longer need to add an init container to download the model in prior, as mentioned this is the simplest implementation, no cache for the moment, we can add it at anytime users asked for.

Any suggestions?

Oct 29 '24 03:10 kerthcet

I fully agree, this seems like the least invasive solution. I'll work on getting it done as soon as possible

Oct 29 '24 08:10 qinguoyi

i will finish this work util 11.2

I am sorry for late commit. I make a pr there https://github.com/InftyAI/llmaz/pull/193, PTAL @kerthcet , thanks.

Nov 03 '24 09:11 qinguoyi

Could we close this issue？@kerthcet

Nov 11 '24 12:11 qinguoyi

Yes, we can. One tip, you can set the PR description like fix #xxx then the issue will be closed as soon as the PR is merged. Better not to remove the fix.

/close

Nov 12 '24 02:11 kerthcet