paddler icon indicating copy to clipboard operation
paddler copied to clipboard

Supervisor

Open Propfend opened this issue 1 year ago • 2 comments

Supervisor

With Supervisor we can manage llamacpp servers in an OS level with syscalls.

Supervisor will have two services, Where one of them serves an HTTP API to manage the llamacpp server through an endpoint, v1/params. The other one will apply those POST requested changes by the user, The ApplyingService receives a channel from the ManagingService with valid arguments. OBS: Its on user responsability to use llamacpp valid arguments and manage one's ports in order to not having unexpected behaviours with servers.

Usage

cargo r supervise \
    --supervisor-addr "localhost:8084" \
    --monitoring-interval 3 \
    --args /path/to/your/llamacpp/binary \
    -m /path/to/your/gguf/model \
    --port 8089

Where:

  • supervisor-addr is the address of your ManagingService.
  • monitoring-interval is the time period(in secods) which the superisor will use to verify llamacpp server instance liveness.
  • args is all of the args you would usually pass to your llamacpp binary when you run it, the binary path being the first argument. port is the llamacpp port where it will run and be supervisored, inside your localhost

After that, both ports will be available:

image

image

After running the above command, you can make some POST request with json as a body:

curl --json '{
    "args": {
        "-m": "/path/to/your/new/model",
        "--port": "8081",
        "binary": "/path/to/your/new/llamacpp/binary"
    }
}' http://127.0.0.1:8084/v1/params

And this log message will be displayed. Saying that the changes were applied with success:

image

The llamacpp will keep running with the new changes applied:

image

If any of your args are wrong, the changes will not be applied, supervisor and llamacpp instance will keep running as is:

image

If the llamacpp falls:

A new log message saying that the llamacpp instance was revived again:

image

Considerations:

Should llamacpp supervisor have an optional compilation?

Should llamacpp supervisor have an optional compilation as this is not required to have llamacpp's main purpose(loadbalancer and reverseproxy)? see #32.

the etcd type from the configuration driver Supervisor will have conditional compilation with the etcd flag as we have additional dependencies with it.

How should Paddler download binaries?

Prediction and more details about Paddler downloading binaries: #32.

Complete draft about supervisor feature

Complete draft about supervisor feature #32

This is a draft

This is just a draft and this PR is error prone. And as you just saw, there are some open questions which needs to be fullfilled in order to have the feature completely added. This is not supposed to be an absolute Documentation neither 100% correct.

Update

Updating and enhacements on Supervisor. Supervisor is currently functional but we have some improvements, see comment.

Monitoring interval

Now, Supervisor does not need to monitor in some interval, it should know right away when some llama.cpp instance falls. See commit

Default and persistent configuration

We currently have a problem with Supervisor runtime itself, if you always start Supervisor and want it to have always the same config, you are actually not able, even if the Supervisor falls for some reason. So we need to persist that configuration, and that can't be inherent to the Supervisor state, as it will be resetted every time. We have KV databases options like etcd or even file storage.

The default arguments will be just the required ones to run in simplest way possible; --model, --binary and --port, if you want to add new options, you can use /v1/params endpoint. The File storage will be the default one, behing the --file flag, which accetps a PathBuf. The server configuration storage is behing the flag --etcd, which can be enabled with the custom feature etcd, which accepts a SocketAddr of an existent and running etcd server. Using both --file and --etcd flags will cause a panic. See commit

Some command example can be the following:

cargo r supervise \
    --supervisor-addr "localhost:8083" \ 
    --binary llama-server \
    --model /usr/local/models/qwen2-500m.gguf \
    --port 8081 \
    --config-driver '{"type": "file", "path": "/usr/local/paddler/config.toml", "name": "gpt"}'

or with etcd feature enabled:

cargo r --features etcd supervise \
    --supervisor-addr "localhost:8083" \ 
    --binary llama-server \
    --model /usr/local/models/qwen2-500m.gguf \
    --port 8081 \
    --config-driver '{"type": "etcd", "addr": "localhost:2379", "name": "deepseek"}'

This will happen:

graph TD
    A{Client} -- Start llama.cpp --> B{Supervisor}
    B -- Has configuration? --> C[Configuration]
    C -- Yes --> E[Supervisor]
    C -- No --> D[Supervisor ]
    D -- Use default configuration --> F["Persisted Config"]
    E -- Use available configuration --> F

You can have more than one Supervisor configuration per file/server:

image

So, even if you have --port, --binary and --model wrong, Supervisor will try to use configuration inside /usr/local/paddler/config/configuration.toml first. If /usr/local/paddler/config/configuration.toml does not exist, default arguments will be used instead. After that, of course, they are persisted in /usr/local/paddler/config/configuration.toml for next time.

Debouncing requests

As mentioned before in the enhaced comment:

For example, let's say we have a throttle of ~200ms, and restarting llama.cpp with new parameters takes 2 seconds. Let's imagine a scenario like this, a few requests come in at times:

0ms POST /args {"model": "foo.gguf"}
20ms POST /args {"model": "baz.gguf", "cb": 5}
100ms POST /args {"model": "foo.gguf"}
200ms POST /args {"cb":6}

So now, if we had a throttle of ~200ms, those all requests would be combined into just one.

{"model":"foo.gguf","cb":6}

We wont need a new node. Because ApplicationService doesn't serve anything, it communicates with ManagementService by channels, so the request batches from ManagementService should be batched here. Incoming batches from HTTP endpoint are debounced with some throttle of x. See commit. The arguments will go as is to ApplicationService. We will keep the same logic as before to "validate" arguments, we wont keep arguments which dont work, thats why we need to send another channel to ConfigurationService from ApplicationService, ConfigurationService will then persist these configs. So the entire flow will be:

sequenceDiagram
  participant C as Client
  participant M as Management Service
  participant A as Appication Service
  participant L as "llama.cpp"
  participant P as Configuration Service
  participant O as "Persistent Config"

  C ->> M: Change model to "foo.gguf"
  M ->> A: Change model to "foo.gguf"
  A ->> L: Change model to "foo.gguf"
  A ->> P: Model changed to "foo.gguf"
  P ->> O: model -> "foo.gguf"

Propfend avatar Jan 07 '25 15:01 Propfend

Overall, that PR looks good so far, but we need to figure out the following:

  1. Stress tests (to ensure the supervisor can handle parameter requests coming faster than the supervisor can restart llama.cpp). We need to nail the supervisor down and make sure it handles all the edge cases, performance tests, etc. because it will be the basis for fleets later.
  2. Make sure llama.cpp is always started with the latest config (supervisor does not forget the config after it itself restarts - it shouldn't in theory, but that is still the case we need to handle).
  3. Maybe move default parameters somewhere else; implement configuration store from the start? (bc supervisor keeps parameters in memory; when it restarts, it should start with the latest parameters that the user set in the API). We need to handle the case when the paddler supervisor is restarted and ensure it remembers the parameters set by the end user.
  4. Skip the 3-sec monitoring interval; it should be able to determine if llama.cpp is up or down immediately.

mcharytoniuk avatar Feb 08 '25 13:02 mcharytoniuk

Let us call a request to the supervisor "args request" (the kind of request the user sends to update the parameters of a running llama.cpp instance). Then the flow would look somewhat like this:

sequenceDiagram
    participant C as Client
    participant S as Supervisor
    participant L as "llama.cpp"
    participant O as "Persistent Config"
    C->>S: Change model to "foo.gguf"
    S->>L: Restart, load a new model
    S->>O: Current model is "foo.gguf"
    S->>C: llama.cpp is restarted

Pingora handles the throttling of those "args requests" themselves, but we should treat that entire process as a transaction and be sure it is treated as such and never interrupted by anything from the outside. That means we need to queue the incoming "args requests" and introduce some throttle/debounce mechanism, and probably merge/batch them to avoid being unnecessarily applied multiple times.

We need some coordination mechanisms in the system between Pingora and the ApplyingService.

For example, let's say we have a throttle of ~200ms, and restarting llama.cpp with new parameters takes 2 seconds. Let's imagine a scenario like this, a few requests come in at times:

0ms POST /args {"model": "foo.gguf"}
20ms POST /args {"model": "baz.gguf", "cb": 5}
100ms POST /args {"model": "foo.gguf"}
200ms POST /args {"cb":6}

So now, if we had a throttle of ~200ms, those all requests would be combined into just one.

{"model":"foo.gguf","cb":6}

In another situation, let's say we get requests like this:

200ms POST /args {"model": "foo.gguf"}
401ms POST /args {"model": "bar.gguf"}
602ms POST /args {"model": "baz.gguf"}

In that case:

  1. throttle did not kick in, llama.cpp should still restart twice overall because it waited 200ms after "foo.gguf"
  2. then it should NOT apply "bar.gguf", bc "foo.gguf" update is still underway
  3. in the meantime "baz.gguf" came in, so it should wait to apply "baz.gguf" instead of "bar.gguf"
  4. 2 sec later, "foo.gguf" is applied, only then it should start applying "baz.gguf"

mcharytoniuk avatar Feb 11 '25 13:02 mcharytoniuk