armada icon indicating copy to clipboard operation
armada copied to clipboard

Improve node discovery for distributed ML Armada jobs

Open dannyfriar opened this issue 3 years ago • 0 comments

Node discovery is currently implemented by either having nodes coordinate via a distributed filesystem or otherwise waiting until the jobs are running and using the K8s API to check which node is running a particular head node script.

A supported way of launching a job set and labelling worker and head nodes so these can be easily discovered from the jobs themselves would be very useful for supporting distributed ML on armada.

┆Issue is synchronized with this Jira Task by Unito

dannyfriar avatar Jul 13 '22 08:07 dannyfriar