llmaz icon indicating copy to clipboard operation
llmaz copied to clipboard

Support autoscaling

Open kerthcet opened this issue 2 years ago • 9 comments

As the service.Spec describes, we have minReplicas and maxReplicas, what we hope to do is adjust the number based on the traffic, aka. servreless. We can use ray or keda/knative as alternatives, but here we hope we can have a simple implementation, then no need to depend on other libraries.

For the first step, let's integrate with HPA for autoscaling capacities.

kerthcet avatar Nov 23 '23 06:11 kerthcet

/milestone v0.0.1

kerthcet avatar Jul 10 '24 02:07 kerthcet

/kind feature

kerthcet avatar Jul 10 '24 02:07 kerthcet

/milestone clear

kerthcet avatar Jul 10 '24 02:07 kerthcet

/priority important-longterm

kerthcet avatar Jul 15 '24 05:07 kerthcet

/milestone v0.2.0

kerthcet avatar Aug 05 '24 03:08 kerthcet

/assign

If the service controller needs to be integrated with hpa, I am willing to give it a try. Is it related to service.Spec.WorkloadTemplate.Replicas?

googs1025 avatar Sep 24 '24 05:09 googs1025


type ElasticConfig struct {
	// MinReplicas indicates the minimum number of inference workloads based on the traffic.
	// Default to nil means we can scale down the instances to 1.
	// If minReplicas set to 0, it requires to install serverless component at first.
	// +kubebuilder:default=1
	// +optional
	MinReplicas *int32 `json:"minReplicas,omitempty"`
	// MaxReplicas indicates the maximum number of inference workloads based on the traffic.
	// Default to nil means there's no limit for the instance number.
	// +optional
	MaxReplicas *int32 `json:"maxReplicas,omitempty"`
	// Metrics contains the specifications which are used to calculate the
	// desired replica count (the maximum replica count across all metrics will
	// be used).  The desired replica count is calculated with multiplying the
	// ratio between the target value and the current value by the current
	// number of pods. Ergo, metrics used must decrease as the pod count is
	// increased, and vice-versa.  See the individual metric source types for
	// more information about how each type of metric must respond.
	// If not set, the HPA will not be created.
	// +optional
	Metrics []autoscalingv2.MetricSpec `json:"metrics,omitempty"`
}

@kerthcet Should we integrate hpa metrics so that we can also set the required metrics in ElasticConfig

googs1025 avatar Sep 26 '24 04:09 googs1025

I will revisit this latter, but in my imagination, I just don't want to copy the fields from HPA to ElasticConfig, I hope it can work with various systems, like HPA, keda, so the fields should be abstract sufficiently.

kerthcet avatar Sep 26 '24 11:09 kerthcet

Indeed,. That is, we only need to abstract the fields. The controller provides a provider-like interface (e.g. HPAProvides) internally. These features are implemented internally. right?

googs1025 avatar Sep 26 '24 11:09 googs1025

Some related metrics:

  • vllm: https://github.com/vllm-project/vllm/issues/5041
  • TGI: https://github.com/huggingface/text-generation-inference/issues/1977

kerthcet avatar Oct 30 '24 02:10 kerthcet

@googs1025 would you like to implement the hpa as our first step, I think we have to align with lws right now which only supports hpa only.

But let's not use the autoscalingv2 library directly, let's build our structure instead. And we can break into 2 PRs, one for API and another for implementation. Tell us if you're interested, thanks anyway.

kerthcet avatar Dec 23 '24 09:12 kerthcet

I'll take a look at it over this weekend and put some thoughts on.

googs1025 avatar Dec 24 '24 03:12 googs1025

Thanks, it's a really important feature to us.

kerthcet avatar Dec 24 '24 07:12 kerthcet

/assign

Take it over as target for milestone v0.1.0 /milestone v0.1.0

kerthcet avatar Jan 21 '25 07:01 kerthcet