Feature request: Auto scale support
The total available resources in cluster may change over time, it would be nice to support auto-scale if user want their training jobs to use as most resources as possible.
Common case will be:
- create a et-job with
--np,--min-npand--max-np,--nptend to be small since launcher won't start if not at least np workers ready - auto scale-out if extra worker could be created until reach
--max-np - [optional] auto scale-in if some worker been preempted or failed unexpected
If this is not the default behavior, it can be controlled by something like a scalePolicy:auto.
The total available resources in cluster may change over time, it would be nice to support auto-scale if user want their training jobs to use as most resources as possible.
Common case will be:
- create a et-job with
--np,--min-npand--max-np,--nptend to be small since launcher won't start if not at least np workers ready- auto scale-out if extra worker could be created until reach
--max-np- [optional] auto scale-in if some worker been preempted or failed unexpected
If this is not the default behavior, it can be controlled by something like a
scalePolicy:auto.
We can support scalein the trainingJob when job's workers preempted or failed.
But as to auto scale-up, the main problem is that how can et-operator knows when to scaleup and which job can scaleup .
In my mind, there are 2 approach:
-
et-operatorwatch the cluster resources, simulate scheduling, then make decision whether or not scale up a job. Likecluster-autoscalerdoes. -
et-operatorjust create target workers pod, let the scheduler in k8s cluster deal with those pending pods.et-operatorneed to start launcher pod when--min-nppods is running.
For now, I prefer the second way. Maybe in the future, we can add a thirdPary component, who's job is monitoring Spot instance price / cluster resource / trainingJob GPU usage, and trigger trainingJob scaleup or scalein
et-operatorjust create target workers pod, let the scheduler in k8s cluster deal with those pending pods.et-operatorneed to start launcher pod when--min-nppods is running.For now, I prefer the second way. Maybe in the future, we can add a thirdPary component, who's job is monitoring Spot instance price / cluster resource / trainingJob GPU usage, and trigger trainingJob scaleup or scalein
Yeah agreed. That should be enough now and if we want give user more control on the auto-scale part, we can add a scale interval, so that jobs will not be restarted too frequently.