kuberay
kuberay copied to clipboard
[Bug] Enable control-plane scaling
Search before asking
- [X] I searched the issues and found no similar issues.
KubeRay Component
Others
What happened + What you expected to happen
Currently when the Head Node pod is bounced the entire state of the Ray Cluster is lost, jobs end, and logs are lost.
This could be fixed with horizontally scaling the head node:
- Having multiple head node pods would increase redundancy and further reliability of the Ray Cluster.
- Having the Ray Cluster to recover if the head node is bounced.
I believe this would be a huge enhancement of Kuberay.
Reproduction script
Bounce the control-plane and the entire Cluster is "restarts" with a new state.
Anything else
Unsure if Ray's current architecture can handle dynamic scaling of control-planes, but I believe being able to scale to 1-3 head nodes statically would be an option.
Are you willing to submit a PR?
- [X] Yes I am willing to submit a PR!