elasticdl icon indicating copy to clipboard operation
elasticdl copied to clipboard

Any plan to support gang scheduling?

Open xiaogaozi opened this issue 6 years ago • 3 comments

Gang scheduling is a very important feature on machine learning job. Especially when you use a shared cluster. By now ElasticDL just create workers and parameter servers immediately. Is there any plan to support gang scheduling? Or integrate with other K8s scheduler that already support this feature (e.g. Volcano)?

xiaogaozi avatar Jan 10 '20 08:01 xiaogaozi

@xiaogaozi Thank you for your reminder, I will have a look at Gang scheduling and give you feedback soon.

QiJune avatar Jan 13 '20 00:01 QiJune

We are using priority-based preemption for all pods. PS pods have higher priorities and workers have lower priorities. As long as there is at least one worker pod, the training can continue.

skydoorkai avatar Jan 13 '20 06:01 skydoorkai

How to ensure there're enough resource to run all PS pods? Could ElasticDL allow PS pods scaling dynamically?

xiaogaozi avatar Jan 13 '20 06:01 xiaogaozi