training-operator feat: Support debugging webhooks locally

What this PR does / why we need it: Add a method for quick local development and debugging, enabling developers to rapidly develop and debug the operator on their local machines, such as a Mac.

Which issue(s) this PR fixes (optional, in Fixes #<issue number>, #<issue number>, ... format, will close the issue(s) when PR gets merged): Fixes #2226

Checklist:

[x] Docs included if any changes are user facing

Aug 17 '24 13:08 shalousun

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: Once this PR has been reviewed and has the lgtm label, please assign gaocegege for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

Aug 17 '24 13:08 google-oss-prow[bot]

Even if the certificates are configured locally, the kube-apiserver may still not be able to access the local webhook server (without using kind).

Should we add an env ENABLE_WEBHOOKS=false to disable webhooks for easier local debugging?

@shalousun @tenzen-y @andreyvelich What do you think?

Aug 22 '24 08:08 Syulin7

Even if the certificates are configured locally, the kube-apiserver may still not be able to access the local webhook server (without using kind).

Should we add an env ENABLE_WEBHOOKS=false to disable webhooks for easier local debugging?

@shalousun @tenzen-y @andreyvelich What do you think?

I agree with this approach. In my local environment, I actually disabled webhooks by adding the ENABLE_WEBHOOKS flag. This is a modification I made myself, which you can see in this commit: https://github.com/shalousun/training-operator/commit/45d9ebb8b42a429fa6de3ff5272de02d3dba415b.

Aug 22 '24 11:08 shalousun

Even if the certificates are configured locally, the kube-apiserver may still not be able to access the local webhook server (without using kind).

Should we add an env ENABLE_WEBHOOKS=false to disable webhooks for easier local debugging?

@shalousun @tenzen-y @andreyvelich What do you think?

Instead of the special environment variable, you can use the "Ignore" failurePolicy like here: https://github.com/kubeflow/training-operator/blob/3f9b0a41663456998e352212dfbfd1386b48d7ab/manifests/base/webhook/manifests.yaml#L14

Aug 22 '24 15:08 tenzen-y

Even if the certificates are configured locally, the kube-apiserver may still not be able to access the local webhook server (without using kind). Should we add an env ENABLE_WEBHOOKS=false to disable webhooks for easier local debugging? @shalousun @tenzen-y @andreyvelich What do you think?

Instead of the special environment variable, you can use the "Ignore" failurePolicy like here:

https://github.com/kubeflow/training-operator/blob/3f9b0a41663456998e352212dfbfd1386b48d7ab/manifests/base/webhook/manifests.yaml#L14

First, I agree with using the failurePolicy setting, which is an elegant way and does not cause destructive changes to webhooks already existing in the cluster (for example, when starting an operator on a local PC to connect to a remote test cluster for debugging). However, for the scenario of rapid local debugging, we also need to skip the webhook startup checks, otherwise, there will be certificate verification issues. Therefore, providing the ENABLE_WEBHOOKS configuration should be necessary. Secondly, for modifying the failurePolicy, I believe it should be handled through a kubebuilder patch that is automatically applied, and this patch should be kept separately in the overlays/local directory.

Aug 22 '24 17:08 shalousun

@shalousun Do you have a startup check if you set failurePolicy: Ignore ?

Aug 28 '24 16:08 andreyvelich

Pull Request Test Coverage Report for Build 10432779786

Warning: This coverage report may be inaccurate.

This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.

For more information on this, see Tracking coverage changes with pull request builds.
To avoid this issue with future PRs, see these Recommended CI Configurations.
For a quick fix, rebase this PR at GitHub. Your next report should be accurate.

Details

0 of 0 changed or added relevant lines in 0 files are covered.
6 unchanged lines in 1 file lost coverage.
Overall coverage decreased (-0.03%) to 33.435%

Files with Coverage Reduction	New Missed Lines	%
pkg/controller.v1/mpi/mpijob_controller.go	6	80.67%
<!--	Total:	6

Totals
Change from base Build 10408646954:	-0.03%
Covered Lines:	3945
Relevant Lines:	11799

💛 - Coveralls

Aug 28 '24 16:08 coveralls

@shalousun Do you have a startup check if you set failurePolicy: Ignore ?

ENABLE_WEBHOOKS=false

I have tested by setting failurePolicy: Ignore which allows the webhook to not enforce failure checks, enabling normal testing of the operator in local development tools. However, the webhook configuration is auto-generated, and the generated policy is embedded in the Go code used to write the webhook, such as

// +kubebuilder:webhook:path=/validate-kubeflow-org-v1-pytorchjob,mutating=false,failurePolicy=fail,sideEffects=None,groups=kubeflow.org,resources=pytorchjobs,verbs=create;update,versions=v1,name=validator.pytorchjob.training-operator.kubeflow.org,admissionReviewVersions=v1

var _ webhook.CustomValidator = &Webhook{}

Therefore, providing a specific template for local debugging under the overlays directory is a better approach as it does not require any manual modifications. Finally, adding ENABLE_WEBHOOKS=false is also necessary as it allows local development and debugging without having to worry about the complex certificate generation process.

Aug 31 '24 16:08 shalousun

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Nov 29 '24 20:11 github-actions[bot]

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Dec 19 '24 20:12 github-actions[bot]