flyte icon indicating copy to clipboard operation
flyte copied to clipboard

[BUG] Ray plugin TTL fails to delete cluster resources if the ray job never reaches Success/Failure state

Open guozhen-la opened this issue 2 years ago • 4 comments

Describe the bug

When a ray job is created via the Flyte Ray plugin, the TTL does not delete resources as expected if the job never reaches completion with either a Success or Failure status.

For example, the job below is stuck at Initializing and never completes

$ kubectl describe rayjob -n amobeen-development f049a65407ecf41bf91b-n0-3 | grep "Job Deployment Status"  -C 2
Status:
  ...
  Job Deployment Status:  Initializing

As a result, the corresponding cluster is never removed long after the TTL expires

$ kubectl get raycluster -n amobeen-development
NAME                                         DESIRED WORKERS   AVAILABLE WORKERS   STATUS   AGE
f810eeadc622248bb8eb-n0-3-raycluster-chrqp   2                 2                   failed   12h

Expected behavior

The cluster resources should be deleted after upon reaching the TTL regardless of the job state

Additional context to reproduce

No response

Screenshots

No response

Are you sure this issue hasn't been raised already?

  • [X] Yes

Have you read the Code of Conduct?

  • [X] Yes

guozhen-la avatar Jul 26 '23 19:07 guozhen-la

Thank you for opening your first issue here! 🛠

welcome[bot] avatar Jul 26 '23 19:07 welcome[bot]

Check if we're missing a state change.

eapolinario avatar Jul 28 '23 17:07 eapolinario

@guozhen-la, could you share an example to reproduce? I can deep dive into it.

pingsutw avatar Aug 15 '23 22:08 pingsutw

Hello 👋, this issue has been inactive for over 9 months. To help maintain a clean and focused backlog, we'll be marking this issue as stale and will engage on it to decide if it is still applicable. Thank you for your contribution and understanding! 🙏

github-actions[bot] avatar May 12 '24 00:05 github-actions[bot]