Customized flags for backendRuntimes
What would you like to be added:
Right now, we have at most two inferenceModes in backendRuntime, one is Default, another is SpeculativeDecoding, what if people wants to customized there flags for easy usage and refer to the mode in the backendRuntimeConfig, considering flags are really really complex in the inference engine.
Some of our users have little knowledge with the inference engine, so they have no idea how to set the flags to make the inference engine perform better, where this can help.
Generally looks like:
backendRuntimeConfig:
mode: customziedOne
resources:
limits:
cpu: 8
memory: "16Gi"
apiVersion: inference.llmaz.io/v1alpha1
kind: BackendRuntime
metadata:
labels:
app.kubernetes.io/name: backendruntime
app.kubernetes.io/part-of: llmaz
app.kubernetes.io/created-by: llmaz
name: vllm
spec:
args:
- mode: Default
flags:
- --model
- "{{ .ModelPath }}"
- --served-model-name
- "{{ .ModelName }}"
- --host
- "0.0.0.0"
- --port
- "8080"
- mode: CustomizedOne # new added.
Why is this needed:
Better to manage the flags and provide some best practices to the users.
Completion requirements:
This enhancement requires the following artifacts:
- [x] Design doc
- [x] API change
- [x] Docs update
The artifacts should be linked in subsequent comments.
/kind feature
Waiting for feedbacks.
/close as completed.