[BUG] Ray plugin TTL fails to delete cluster resources if the ray job never reaches Success/Failure state
Describe the bug
When a ray job is created via the Flyte Ray plugin, the TTL does not delete resources as expected if the job never reaches completion with either a Success or Failure status.
For example, the job below is stuck at Initializing and never completes
$ kubectl describe rayjob -n amobeen-development f049a65407ecf41bf91b-n0-3 | grep "Job Deployment Status" -C 2
Status:
...
Job Deployment Status: Initializing
As a result, the corresponding cluster is never removed long after the TTL expires
$ kubectl get raycluster -n amobeen-development
NAME DESIRED WORKERS AVAILABLE WORKERS STATUS AGE
f810eeadc622248bb8eb-n0-3-raycluster-chrqp 2 2 failed 12h
Expected behavior
The cluster resources should be deleted after upon reaching the TTL regardless of the job state
Additional context to reproduce
No response
Screenshots
No response
Are you sure this issue hasn't been raised already?
- [X] Yes
Have you read the Code of Conduct?
- [X] Yes
Thank you for opening your first issue here! 🛠
Check if we're missing a state change.
@guozhen-la, could you share an example to reproduce? I can deep dive into it.
Hello 👋, this issue has been inactive for over 9 months. To help maintain a clean and focused backlog, we'll be marking this issue as stale and will engage on it to decide if it is still applicable. Thank you for your contribution and understanding! 🙏