llmaz
llmaz copied to clipboard
βΈοΈ Easy, advanced inference platform for large language models on Kubernetes. π Star to support our work!
**What would you like to be added**: Right now we can download model weights from model hub directly, but each time we start/restart a pod, it will downloading the model...
#### What this PR does / why we need it https://github.com/InftyAI/llmaz/issues/163#issue-2526060924 #### Which issue(s) this PR fixes None #### Special notes for your reviewer this pr, it mainly includes thingsοΌ...
**What would you like to be added**: Take [Mistral](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.3/tree/main) for example, it not only contain the chunked model weights, it also has consolidated model weights, when downloading models from huggingface,...
As the `service.Spec` describes, we have `minReplicas` and `maxReplicas`, what we hope to do is adjust the number based on the traffic, aka. servreless. We can use ray or keda/knative...
Add the support for inference services.
**What would you like to be added**: Once we launched a model, we can simply co-launch a webui for interaction, we can support llmpos tools like dify but maybe we...
We have several integration tests for webhooks, however, they're the very simple ones, we need more, like covering the update cases.
**What would you like to be added**: Right now, model management is a tricky problem in the cluster, it's big, so we need to cache them in the node just...
I came across the roadmap and am particularly interested in the Gateway API section. Will Llamz support advanced traffic management features, such as shadow and canary deployments between different model...
This is the currently unit test coverage: ``` β api/core/v1alpha1 (64ms) β pkg (5ms) β api/inference/v1alpha1 (43ms) β pkg/controller (33ms) β pkg/cert (78ms) β pkg/controller/inference (28ms) β pkg/webhook (23ms) β...