training-operator feat(framework): Support JAX

JAX becomes extremely popular these days. Users may expect to run JAX distributed training jobs on Kubernetes with the help of training-operator.

JAX uses a “multi-controller” programming model where each JAX Python process runs independently, sometimes referred to as a Single Program, Multiple Data (SPMD) model. I think it is not hard to support from the operator's perspective.

Message from the maintainers:

Love this enhancement proposal? Give it a 👍. We prioritise the proposals with the most 👍.

Jun 22 '22 07:06 gaocegege

Any solid examples showing how a multi-host JAX job runs? Especially the host registering part.

Jun 23 '22 01:06 zw0610

Well, to launch a distributed training with Jax, jax.distributed.initialize api should be used. A method to implement this would be like this, training operator provide related environ for each container and the user script should handle them as below. src.

coordinator_address= os.environ.get('JAX_COORDINATOR_ADDRESS', None)  # defined internal
num_processes = int(os.environ.get('JAX_NUM_PROCESSES', 1))  # world size
process_id = int(os.environ.get('JAX_PROCESS_ID', 0)) # rank

jax.distributed.initialize(coordinator_address=coordinator_address, num_processes=num_processes, process_id=process_id)

Anyway, I'm not aware of any mature practice of this in production.

Jun 23 '22 05:06 kuizhiqing

/help

Jul 20 '23 16:07 andreyvelich

@andreyvelich: This request has been marked as needing help from a contributor.

Please ensure the request meets the requirements listed here.

If this request no longer meets these requirements, the label can be removed by commenting with the /remove-help command.

In response to this:

/help

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Jul 20 '23 16:07 google-oss-prow[bot]

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Oct 18 '23 20:10 github-actions[bot]

/lifecycle frozen

Oct 19 '23 16:10 tenzen-y

/assign @Davidnet

For anyone interested in Jax support for Training Operator, please join our AutoML and Training WG Call on November 29th 5:00pm UTC: https://docs.google.com/document/d/1MChKfzrKAeFRtYqypFbMXL6ZIc_OgijjkvbqmwRV-64/edit#heading=h.yvypq06ot57p

We are going to discuss how we can move forward with the Jax support.

Nov 28 '23 17:11 andreyvelich

cc: @mimowo

Michal may be interested in this JAX integration.

Nov 29 '23 21:11 tenzen-y

We'll be interested in supporting JAX as well. Would be interested in contributing developer hours (with mentoring from a qualified Kubeflow maintainer) @sxwl-donggang

Dec 22 '23 11:12 yzhao-2023

Thanks for the interest. Happy to help @yzhao-2023

Dec 22 '23 17:12 johnugeorge

That would be great if you could help us with the Jax implementation in Training Operator.

If you are available @yzhao-2023, please attend one of our upcoming AutoML and Training WG calls. We can guide you through Training Operator implementation and how we can add Jax support.

Dec 22 '23 18:12 andreyvelich

That would be great if you could help us with the Jax implementation in Training Operator.

If you are available @yzhao-2023, please attend one of our upcoming AutoML and Training WG calls. We can guide you through Training Operator implementation and how we can add Jax support.

If you prefer 11:00 am UTC, I'd like to be there too.

Dec 23 '23 15:12 kuizhiqing

Links from the 2023-11-29 Meeting notes:

Developer Guide for Training Operator: https://github.com/kubeflow/training-operator/blob/master/docs/development/developer_guide.md
Distributed Jax: https://jax.readthedocs.io/en/latest/multi_process.html

In the meeting they mentioned that the documentation for the Training Operator was a bit outdated. Has it been updated? EDIT: I see that the .md file was updated 2 weeks ago so I suppose it is up to date.

Feb 11 '24 03:02 jdcfd

Also, if you end up doing a Training Operator Deep dive session, it would be good if you share it here so anyone wanting to contribute can join or watch a recording later.

Feb 11 '24 04:02 jdcfd

hi @andreyvelich im interested in this issue for the upcoming term of gsoc, is there any roadmap doc available and can you provide some more context or resources I can look up to better understand it?

Feb 21 '24 03:02 octonawish-akcodes

Hi @jdcfd @octonawish-akcodes, thank you for your interest to work on Jax support in Training Operator!

If you are available, please attend one of the upcoming AutoML and Training WG calls: https://docs.google.com/document/d/1MChKfzrKAeFRtYqypFbMXL6ZIc_OgijjkvbqmwRV-64/edit#heading=h.yvypq06ot57p We will discuss details on how we can add support for JaxJobs

Feb 21 '24 12:02 andreyvelich

Sorry I missed it, Wednesdays are very tough for me. I did watch last recording and I will watch today's meeting later this week. Judging by the meeting notes, it seems like the JaxJobs topic wasn't touched this time.

Feb 22 '24 06:02 jdcfd

Hi @jdcfd, we briefly discussed Jax support in the recent call: https://youtu.be/rXBCliRugNk We are going to speak more about Jax in the next Training WG community meetings. /area gsoc

Feb 22 '24 14:02 andreyvelich

I am interested in collaborating on a design proposal for integrating Jax into Training Operator.

Mar 10 '24 19:03 sandipanpanda

Why not just use the Job or JobSet API, what is missing?

Mar 28 '24 21:03 ahg-g

Why not just use the Job or JobSet API, what is missing?

Does it mean that why don't you recommend using Job or JobSet instead of TrainingOperator?

Mar 28 '24 22:03 tenzen-y

Yes, for the Jax case, I think Indexed Job will just work and the Job ships with any k8s cluster, so you don't need to install any extra operators. For more advanced setups, like multi-slice TPUs, JobSet works well and it is easy to transition from Job to JobSet.

Mar 28 '24 22:03 ahg-g

@ahg-g As we discussed recently, we should understand if JobSet can cover all use-cases for Jax and other ML frameworks. I remember previously @tenzen-y was working on to add SuccessPolicy support to the Job APIs, so we can re-use Job in the Training Operator.

Also, we should understand the following:

Does Jax support any specific distributed training capabilities that will require to orchestrate additional Kubernetes resources ? Like MPI-Operator.
Do we need specific resource statuses that will be exclusive to JaxJob but not to other Jobs (e.g. PyTorchJob).

To be clear, I am not against using JobSet as a final entity for distributed ML training on Kubernetes and deprecate framework specific CRs, but we need to discuss pros and cons.

Moreover, when @Jeffwan and @zw0610 designed unified Training Operator they proposed the idea of common CR where we have Frontend Operator to manage framework specific resources and Role operator to manage common resources. By that time (2021) we didn't have JobSet yet. In that case, the flow looks like this:

JaxJob -> JobSet -> Job -> Pod
PyTorchJob -> JobSet -> Job -> Pod

Let's collaborate together in the upcoming WG Batch and Kubeflow WG Training community calls to discuss our next steps.

cc @bigsur0

Mar 28 '24 22:03 andreyvelich

Yes, for the Jax case, I think Indexed Job will just work and the Job ships with any k8s cluster, so you don't need to install any extra operators. For more advanced setups, like multi-slice TPUs, JobSet works well and it is easy to transition from Job to JobSet.

@ahg-g I think that kubeflow/JaxJob has some advantages: 1. It can use the same CRD as other frameworks like PyTorchJob and TFJob; 2. No need to setup EnvVars and services; 3 It is possible to use higher level Python SDK.

Indeed, some developers prefer to use the plain Job and JobSet for extensibility, but I believe that some developers prefer to use more abstract API.

So, I believe that both approaches are valuable.

Mar 28 '24 23:03 tenzen-y

@ahg-g As we discussed recently, we should understand if JobSet can cover all use-cases for Jax and other ML frameworks. I remember previously @tenzen-y was working on to add SuccessPolicy support to the Job APIs, so we can re-use Job in the Training Operator.

Also, we should understand the following:

Does Jax support any specific distributed training capabilities that will require to orchestrate additional Kubernetes resources ? Like MPI-Operator.

Do we need specific resource statuses that will be exclusive to JaxJob but not to other Jobs (e.g. PyTorchJob).

To be clear, I am not against using JobSet as a final entity for distributed ML training on Kubernetes and deprecate framework specific CRs, but we need to discuss pros and cons.

Moreover, when @Jeffwan and @kuizhiqing designed unified Training Operator they proposed the idea of common CR where we have Frontend Operator to manage framework specific resources and Role operator to manage common resources. By that time (2021) we didn't have JobSet yet. In that case, the flow looks like this:
JaxJob -> JobSet -> Job -> Pod
PyTorchJob -> JobSet -> Job -> Pod
Let's collaborate together in the upcoming WG Batch and Kubeflow WG Training community calls to discuss our next steps.

cc @bigsur0

I totally agree with @andreyvelich.

Mar 28 '24 23:03 tenzen-y

Thanks @tenzen-y and @andreyvelich.

My worry is that adding another API on top means another operator and so more sources of errors and additional operational overhead.

The points related to automating the configurations (envVars, configmaps etc.) are valid and it is something we are thinking about solutions for in JobSet. One idea is JobSet "extensions": imagine that the JobSet API includes an opaque class parameter of type "Object" that represents the specific training job you want to run, and we introduce hooks in JobSet operator and webhook to actuate on it.

kind: JobSet
spec:
  class: 
    kind: MPI
    ...

The MPI extension within JobSet would know how to parse this class, and populate JobsSet with all things MPI. This is just a rough idea, devil in the details as usual :)

Mar 28 '24 23:03 ahg-g

that represents the specific training job you want to run, and we introduce hooks in JobSet operator and webhook to actuate on it.

@ahg-g In that case the mutation webhook will be responsible to orchestrate additional Kubernetes resources for the Job (e.g. ConfigMap, RBAC) ? How we are going to handle orchestration that needs to happen during the Job runtime ? For example, fetch appropriate status or SSH to the pod in case of MPIJob ?

Apr 02 '24 14:04 andreyvelich

I meant hooks in the general sense as in places where we invoke the workload specific function, that would be one in the webhook and one in the reconciler.

Apr 03 '24 06:04 ahg-g

I meant hooks in the general sense as in places where we invoke the workload specific function, that would be one in the webhook and one in the reconciler.

In that case, users should take JobSet controller and re-build image for reconciler to support such execution, right? Or we will contribute such extensions to the upstream ?

Apr 03 '24 15:04 andreyvelich

/assign @sandipanpanda

May 24 '24 19:05 andreyvelich