fix(operator): fix TrainJob suspend/resume webhook error (#3008)
What this PR does / why we need it:
- Decouple JobSet suspend toggling from the SSA payload so that the controller no longer trips the JobSet webhook's "spec.replicatedJobs is immutable" error when suspending or resuming existing workloads.
- Add a clarifying comment that suspend for existing JobSets is handled via SyncSuspend, preventing future regressions.
Which issue(s) this PR fixes:
- Fixes #3008
Checklist:
- [ ] Docs included if any changes are user facing
[APPROVALNOTIFIER] This PR is NOT APPROVED
This pull-request has been approved by: Once this PR has been reviewed and has the lgtm label, please assign terrytangyuan for approval. For more information see the Kubernetes Code Review Process.
The full list of commands accepted by this bot can be found here.
Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment
🎉 Welcome to the Kubeflow Trainer! 🎉
Thanks for opening your first PR! We're happy to have you as part of our community 🚀
Here's what happens next:
- If you haven't already, please check out our Contributing Guide for repo-specific guidelines and the Kubeflow Contributor Guide for general community standards.
- Our team will review your PR soon! cc @kubeflow/kubeflow-trainer-team
Join the community:
- Slack: Join our #kubeflow-trainer Slack channel.
- Meetings: Attend the Kubeflow AutoML and Training Working Group bi-weekly meetings.
Feel free to ask questions in the comments if you need any help or clarification! Thanks again for contributing to Kubeflow! 🙏
/ok-to-test
Pull Request Test Coverage Report for Build 20355704238
Details
- 24 of 48 (50.0%) changed or added relevant lines in 5 files are covered.
- 1 unchanged line in 1 file lost coverage.
- Overall coverage increased (+0.03%) to 51.469%
| Changes Missing Coverage | Covered Lines | Changed/Added Lines | % |
|---|---|---|---|
| pkg/runtime/core/clustertrainingruntime.go | 0 | 3 | 0.0% |
| pkg/runtime/core/trainingruntime.go | 0 | 3 | 0.0% |
| pkg/controller/trainjob_controller.go | 0 | 8 | 0.0% |
| pkg/runtime/framework/plugins/jobset/jobset.go | 15 | 25 | 60.0% |
| <!-- | Total: | 24 | 48 |
| Files with Coverage Reduction | New Missed Lines | % |
|---|---|---|
| pkg/controller/trainjob_controller.go | 1 | 0.0% |
| <!-- | Total: | 1 |
| Totals | |
|---|---|
| Change from base Build 20289967122: | 0.03% |
| Covered Lines: | 1261 |
| Relevant Lines: | 2450 |
💛 - Coveralls
/assign @terrytangyuan
/close It looks to be working fine now: https://github.com/kubeflow/trainer/issues/3008#issuecomment-3805732673
@andreyvelich: Closed this PR.
In response to this:
/close It looks to be working fine now: https://github.com/kubeflow/trainer/issues/3008#issuecomment-3805732673
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.