et-operator Feature request: Auto scale support

The total available resources in cluster may change over time, it would be nice to support auto-scale if user want their training jobs to use as most resources as possible.

Common case will be:

create a et-job with --np, --min-np and --max-np, --np tend to be small since launcher won't start if not at least np workers ready
auto scale-out if extra worker could be created until reach --max-np
[optional] auto scale-in if some worker been preempted or failed unexpected

If this is not the default behavior, it can be controlled by something like a scalePolicy:auto.

Dec 15 '20 03:12 xychu

The total available resources in cluster may change over time, it would be nice to support auto-scale if user want their training jobs to use as most resources as possible.

Common case will be:

create a et-job with --np, --min-np and --max-np, --np tend to be small since launcher won't start if not at least np workers ready

auto scale-out if extra worker could be created until reach --max-np

[optional] auto scale-in if some worker been preempted or failed unexpected

If this is not the default behavior, it can be controlled by something like a scalePolicy:auto.

We can support scalein the trainingJob when job's workers preempted or failed. But as to auto scale-up, the main problem is that how can et-operator knows when to scaleup and which job can scaleup . In my mind, there are 2 approach:

et-operator watch the cluster resources, simulate scheduling, then make decision whether or not scale up a job. Like cluster-autoscaler does.
et-operator just create target workers pod, let the scheduler in k8s cluster deal with those pending pods. et-operator need to start launcher pod when --min-np pods is running.

For now, I prefer the second way. Maybe in the future, we can add a thirdPary component, who's job is monitoring Spot instance price / cluster resource / trainingJob GPU usage, and trigger trainingJob scaleup or scalein

Dec 16 '20 09:12 xiaozhouX

et-operator just create target workers pod, let the scheduler in k8s cluster deal with those pending pods. et-operator need to start launcher pod when --min-np pods is running.

For now, I prefer the second way. Maybe in the future, we can add a thirdPary component, who's job is monitoring Spot instance price / cluster resource / trainingJob GPU usage, and trigger trainingJob scaleup or scalein

Yeah agreed. That should be enough now and if we want give user more control on the auto-scale part, we can add a scale interval, so that jobs will not be restarted too frequently.

Dec 17 '20 01:12 xychu