Any plan to support gang scheduling?
Gang scheduling is a very important feature on machine learning job. Especially when you use a shared cluster. By now ElasticDL just create workers and parameter servers immediately. Is there any plan to support gang scheduling? Or integrate with other K8s scheduler that already support this feature (e.g. Volcano)?
@xiaogaozi Thank you for your reminder, I will have a look at Gang scheduling and give you feedback soon.
We are using priority-based preemption for all pods. PS pods have higher priorities and workers have lower priorities. As long as there is at least one worker pod, the training can continue.
How to ensure there're enough resource to run all PS pods? Could ElasticDL allow PS pods scaling dynamically?