kuberay icon indicating copy to clipboard operation
kuberay copied to clipboard

[Feature] TTL Delete RayJob CRD After Job Termination

Open peterghaddad opened this issue 2 years ago • 4 comments

Search before asking

  • [X] I had searched in the issues and found no similar feature requirement.

Description

Currently KubeRay enables the cluster to auto terminate after completion, but there is no mechanism to auto delete the Ray Job instance (owner).

This is beneficial for auto k8s cluster clean up, and behaves similar to a K8s jobs TTL.

Use case

Delete the Ray Job alongside controlled via an additional flag.

Related issues

No response

Are you willing to submit a PR?

  • [X] Yes I am willing to submit a PR!

peterghaddad avatar Feb 27 '24 02:02 peterghaddad

@kevin85421 In terms of the solution direction, what if we can specify the submitter Job template, rather than the Pod template? Then you can set the BackoffLimit, ttlSecondsAfterFinished and whatever else someone might need in the future. Given that the pod template is already being patched, why couldn't we do the same for the job template instead?

mickvangelderen avatar Jun 12 '24 08:06 mickvangelderen

@anyscalesam @jjyao how was this completed? I am failing to see how it was fixed.

mickvangelderen avatar Jun 25 '24 07:06 mickvangelderen

@MortalHappiness will take this issue.

kevin85421 avatar Jul 02 '24 16:07 kevin85421

@kevin85421 Thanks.

MortalHappiness avatar Jul 02 '24 16:07 MortalHappiness

@MortalHappiness @kevin85421 it does not seem like the implementation in #2225 allows automatic deletion of the submitter after, let's say 1 week, like the TTL field. This issue does mention TTL. Am I missing something?

I'd like for my peers who create the jobs to be able to view them for some time, but to automatically clean up the job after a week. Is that possible with the current implementation?

mickvangelderen avatar Jul 15 '24 18:07 mickvangelderen

@mickvangelderen I believe the TTLSecondsAfterFinished field applies for deletion TTL if if RayJob deletion is enabled

andrewsykim avatar Jul 15 '24 18:07 andrewsykim

Thanks for the reply. If that is the case, then is it still possible to delete the cluster head and workers (to free up resources) immediately after the job finishes, and then have the submitter be deleted after a week?

mickvangelderen avatar Jul 15 '24 18:07 mickvangelderen

There was a discussion in this thread about whether we want to control this behavior separately: https://github.com/ray-project/kuberay/pull/2097#discussion_r1576422173

As of now, I think you either delete the whole RayJob or only the cluster. I think we're looking for user feedback on whether we should allow controlling the behavior separately (deleting the cluster and deleting the whole RayJob). Is this something you would find useful?

andrewsykim avatar Jul 15 '24 18:07 andrewsykim

I do think it would be useful to control the deletion of the cluster and the submitter separately. However, I might be missing other solutions that would work for me and my team, and so I will give some more detail about how we are using ray.

We have a tool that allows spawning work on a cluster. That work can be performed by a native K8S pod, or if the user so desires by ray workers through a RayJob to leverage the distributed computing facilities ray offers. We do not use a persistent Ray cluster because the persistent cluster would sometimes end up in an unreliable state, possible due to our unreliable hardware. RayJobs have been working great.

We want our users to be able to view the logs of their work for about a week. Any data that must be persistent is stored in an external system. After one week, we want to clean up all jobs to keep things tidy and free up space. For K8S jobs, we use the ttlSecondsAfterFinished field. For RayJobs, we are using ShutdownAfterJobFinishes to stop the cluster and the workers which frees up any resources (CPU, GPU) that they claimed. We like that the submitter is not automatically deleted because we need that for our users to view the logs. However, we would like to clean up the submitter after a week. If we were able to specify the ttlSecondsAfterFinished on the submitter job, we would have an easy solution. Instead, we need to set up some cron job to work around simply not being to specify a certain field in the submitter pod spec.

Similarly, we would also like to set the backoffLimit to 1 for the submitter pod, instead of the default 3 that ray sets. Most often the issue is that the entrypoint that our users have specified is somehow incorrect, and it causes the submitter to restart 3 times which is noisy and useless.

Hopefully this clarifies why we are interested in this functionality. If you see a better solution direction to accomplish our objectives, please let me know.

mickvangelderen avatar Jul 15 '24 19:07 mickvangelderen

Do you only care about the submitter being deleted or do you also care about the RayJob resource itself being cleaned up?

Similarly, we would also like to set the backoffLimit to 1 for the submitter pod, instead of the default 3 that ray sets. Most often the issue is that the entrypoint that our users have specified is somehow incorrect, and it causes the submitter to restart 3 times which is noisy and useless.

This will be possible in KubeRay v1.2: https://github.com/ray-project/kuberay/pull/2091

Hopefully this clarifies why we are interested in this functionality. If you see a better solution direction to accomplish our objectives, please let me know.

I think fundamentally what you need is better tooling to persist the Ray job logs. Once you have this you don't need to care about how long the cluster stays around when the job is deleted. Although I can see value in being able to read the logs directly with kubectl.

andrewsykim avatar Jul 15 '24 20:07 andrewsykim

Do you only care about the submitter being deleted or do you also care about the RayJob resource itself being cleaned up?

I might be confused with the terminology. Looking at the RayJob quickstart I want the RayCluster to be deleted immediately after the job finishes and I would like the logs to be available for a week. I thought the logs were tied to something referred to as "the submitter". I'm not sure if this submitter is a Pod, a Job, a Ray Job (notice the space) or something else.

mickvangelderen avatar Jul 15 '24 22:07 mickvangelderen

@kevin85421 I remembered that you want to make the logging part to become structured logging such that logs can be read from external tools. Is that feature related to log persistence mentioned in this thread?

MortalHappiness avatar Jul 16 '24 06:07 MortalHappiness