llmaz Customized flags for backendRuntimes

What would you like to be added:

Right now, we have at most two inferenceModes in backendRuntime, one is Default, another is SpeculativeDecoding, what if people wants to customized there flags for easy usage and refer to the mode in the backendRuntimeConfig, considering flags are really really complex in the inference engine.

Some of our users have little knowledge with the inference engine, so they have no idea how to set the flags to make the inference engine perform better, where this can help.

Generally looks like:

  backendRuntimeConfig:
    mode: customziedOne
    resources:
      limits:
        cpu: 8
        memory: "16Gi"

apiVersion: inference.llmaz.io/v1alpha1
kind: BackendRuntime
metadata:
  labels:
    app.kubernetes.io/name: backendruntime
    app.kubernetes.io/part-of: llmaz
    app.kubernetes.io/created-by: llmaz
  name: vllm
spec:
  args:
    - mode: Default
      flags:
        - --model
        - "{{ .ModelPath }}"
        - --served-model-name
        - "{{ .ModelName }}"
        - --host
        - "0.0.0.0"
        - --port
        - "8080"
    - mode: CustomizedOne # new added.

Why is this needed:

Better to manage the flags and provide some best practices to the users.

Completion requirements:

This enhancement requires the following artifacts:

[x] Design doc
[x] API change
[x] Docs update

The artifacts should be linked in subsequent comments.

Sep 11 '24 02:09 kerthcet

/kind feature

Sep 11 '24 02:09 kerthcet

Waiting for feedbacks.

Sep 11 '24 02:09 kerthcet

/close as completed.

Jan 23 '25 15:01 kerthcet