armada
armada copied to clipboard
Improve node discovery for distributed ML Armada jobs
Node discovery is currently implemented by either having nodes coordinate via a distributed filesystem or otherwise waiting until the jobs are running and using the K8s API to check which node is running a particular head node script.
A supported way of launching a job set and labelling worker and head nodes so these can be easily discovered from the jobs themselves would be very useful for supporting distributed ML on armada.