OCPEDGE-2084: Add PacemakerStatus CRD for two-node fencing
Introduces tnf.etcd.openshift.io/v1alpha1 API group with PacemakerStatus custom resource. This provides visibility into Pacemaker cluster health for dual-replica etcd deployments. The status-only resource is populated by a privileged controller and consumed by the cluster-etcd-operator healthcheck controller. Not gated because it's only used by CEO when two-node has transitioned.
Works in conjunction with https://github.com/openshift/cluster-etcd-operator/pull/1487
Depends on https://github.com/kubernetes-sigs/controller-tools/pull/1324, which in turn needs a bump for go 1.25 In the meantime, I've vendored: https://github.com/openshift/kubernetes-sigs-controller-tools/pull/33
@jaypoulz: This pull request references OCPEDGE-2084 which is a valid jira issue.
Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.21.0" version, but no target version was set.
In response to this:
Introduces tnf.etcd.openshift.io/v1alpha1 API group with PacemakerStatus custom resource. This provides visibility into Pacemaker cluster health for dual-replica etcd deployments. The status-only resource is populated by a privileged controller and consumed by the cluster-etcd-operator healthcheck controller. Gated by DualReplica feature and managed by two-node-fencing component.
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.
Hello @jaypoulz! Some important instructions when contributing to openshift/api: API design plays an important part in the user experience of OpenShift and as such API PRs are subject to a high level of scrutiny to ensure they follow our best practices. If you haven't already done so, please review the OpenShift API Conventions and ensure that your proposed changes are compliant. Following these conventions will help expedite the api review process for your PR.
@jaypoulz: This pull request references OCPEDGE-2084 which is a valid jira issue.
Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.21.0" version, but no target version was set.
In response to this:
Introduces tnf.etcd.openshift.io/v1alpha1 API group with PacemakerStatus custom resource. This provides visibility into Pacemaker cluster health for dual-replica etcd deployments. The status-only resource is populated by a privileged controller and consumed by the cluster-etcd-operator healthcheck controller. Gated by DualReplica feature and managed by two-node-fencing component.
Works in conjunction with https://github.com/openshift/cluster-etcd-operator/pull/1487
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.
@jaypoulz thank you for the PR, do you mind making the CI happy?
Hi @saschagrunert :) Working on it! :D New to this repo so working through beginner challenges :smile_cat:
A few open questions I have:
- This is a config object of a sort. It's created by cluster-etcd-operator only when you have a two-node cluster and only for the purposes of gathering information about the health of pacemaker (our ha tool) from the nodes. I put it in etcd/tnf (two node fencing) because it seemed sensible. But I'm not sure if it needs to be in
config.
That said, it doesn't work like a normal config - there's no spec and it shouldn't be created during bootstrap. The CRD just needs to be present when the CEO runs an cronjob to post an update to it.
-
bash hack/update-protobuf.shfailed for me because it wanted the path to be installed in my go path. That said, cursor happily runs it and copies over the files without issue. I'm just skeptical of the zz_generated files, but I assume those are verified by CI? -
For the non-boolean enum fields. Should I be creating static string definitions that can be exported to CEO? How do I generate those?
Yeah, I'll ignore the CI failures for now, running ./hack/update-codegen.sh locally also gives me a diff in openapi/generated_openapi/zz_generated.openapi.go. :upside_down_face:
A few open questions I have:
- This is a config object of a sort. It's created by cluster-etcd-operator only when you have a two-node cluster and only for the purposes of gathering information about the health of pacemaker (our ha tool) from the nodes. I put it in etcd/tnf (two node fencing) because it seemed sensible. But I'm not sure if it needs to be in
config.
I'm new to API review, but my gut feeling tells me that a dedicated etcd API group sounds fine for that purpose.
That said, it doesn't work like a normal config - there's no spec and it shouldn't be created during bootstrap. The CRD just needs to be present when the CEO runs an cronjob to post an update to it.
bash hack/update-protobuf.shfailed for me because it wanted the path to be installed in my go path. That said, cursor happily runs it and copies over the files without issue. I'm just skeptical of the zz_generated files, but I assume those are verified by CI?
You can also try to run it in a container by make verify-with-container.
- For the non-boolean enum fields. Should I be creating static string definitions that can be exported to CEO? How do I generate those?
Do you mind elaborating on that? Do you mean generating the code for the unions?
API docs ref: https://github.com/openshift/enhancements/blob/master/dev-guide/api-conventions.md#writing-a-union-in-go
@jaypoulz is there an OpenShift enhancement available for this change?
@jaypoulz: This pull request references OCPEDGE-2084 which is a valid jira issue.
Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.21.0" version, but no target version was set.
In response to this:
Introduces tnf.etcd.openshift.io/v1alpha1 API group with PacemakerStatus custom resource. This provides visibility into Pacemaker cluster health for dual-replica etcd deployments. The status-only resource is populated by a privileged controller and consumed by the cluster-etcd-operator healthcheck controller. Not gated because it's only used
Works in conjunction with https://github.com/openshift/cluster-etcd-operator/pull/1487
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.
@jaypoulz: This pull request references OCPEDGE-2084 which is a valid jira issue.
Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.21.0" version, but no target version was set.
In response to this:
Introduces tnf.etcd.openshift.io/v1alpha1 API group with PacemakerStatus custom resource. This provides visibility into Pacemaker cluster health for dual-replica etcd deployments. The status-only resource is populated by a privileged controller and consumed by the cluster-etcd-operator healthcheck controller. Not gated because it's only used by CEO when two-node has transitioned.
Works in conjunction with https://github.com/openshift/cluster-etcd-operator/pull/1487
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.
The high level feature is explained here: https://github.com/openshift/enhancements/blob/91301b6458c6a4730708c9b04a24457df3a8b1e2/enhancements/two-node-fencing/tnf.md#L4
The functionality I'm working on is explained in a little bit of detail here: https://issues.redhat.com/browse/OCPEDGE-2097
From a high level - two node fencing relies on a program (pacemaker) to understand the status of etcd in the cluster because it can lose quorum at any time if one of the two nodes fail. This API is used to post status updates about pacemaker's health in a way that will only register while you have quorum - so it's almost 100% for informational purposes. The only time we'll reference it outside of just alerting the end user of CEO going degraded when something is unhealth is during node-replacement events, where CEO needs to know the nodes and IPs in the pacemaker cluster in order to safely allow or deny a node replacement scenario.
What I was asking about was: https://github.com/openshift/enhancements/blob/master/dev-guide/api-conventions.md#do-not-use-boolean-fields
Do we usually provide constants for the non-boolean fields for easy reference?
- For the non-boolean enum fields. Should I be creating static string definitions that can be exported to CEO? How do I generate those?
Do you mind elaborating on that? Do you mean generating the code for the unions?
API docs ref: https://github.com/openshift/enhancements/blob/master/dev-guide/api-conventions.md#writing-a-union-in-go
I saw the kinds of constants I was thinking about in the control plan topology type, so I decided to proceed in that direction. Should be more obvious what I meant now. :)
@saschagrunert CI is happy :face_holding_back_tears:
Or at least it was when I wrote that comment - I decided to update the READMEs. :crossed_fingers: I didn't break anything
/test okd-scos-e2e-aws-ovn
@saschagrunert I think I hit all of your comments. I've also asked pacemaker expert CLumens from the RHEL team to make sure I wasn't misrepresenting anything in the new spec.
/retest
/retest
Since @saschagrunert has said this is good from his side, I'll now take over the API review. Since it's shift week, I'm not expecting to pick this up until Monday
Sounds good to me! :)
/retest-required
/retest-required
/retest-required
📝 Walkthrough
Walkthrough
Adds API group etcd.openshift.io v1alpha1 and registers it with the scheme. Introduces a cluster-scoped PacemakerCluster CRD (singleton required to be named "cluster") with a status subresource and a detailed status model (cluster, node, resource, fencing-agent types and enums) plus extensive OpenAPI validations. Adds CRD manifests (multiple feature-gated variants), generated deepcopy, OpenAPI/Swagger doc and schema wiring, feature-gated CRD listings, tests and test fixtures, a Makefile test target, README documentation, payload-manifest inclusion, and wiring in global install to call etcd.Install.
🚥 Pre-merge checks | ✅ 2 | ❌ 1
❌ Failed checks (1 warning)
| Check name | Status | Explanation | Resolution |
|---|---|---|---|
| Docstring Coverage | ⚠️ Warning | Docstring coverage is 7.14% which is insufficient. The required threshold is 80.00%. | Write docstrings for the functions missing them to satisfy the coverage threshold. |
✅ Passed checks (2 passed)
| Check name | Status | Explanation |
|---|---|---|
| Title check | ✅ Passed | The PR title 'Add PacemakerStatus CRD for two-node fencing' directly references the main change: introduction of a PacemakerCluster CRD for managing Pacemaker cluster health. |
| Description check | ✅ Passed | The PR description comprehensively explains the purpose, context, and dependencies of the PacemakerStatus CRD addition for dual-replica etcd deployments. |
✏️ Tip: You can configure your own custom pre-merge checks in the settings.
✨ Finishing touches
- [ ] 📝 Generate docstrings
[!WARNING] There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.
🔧 golangci-lint (2.5.0)
Error: build linters: unable to load custom analyzer "kubeapilinter": tools/_output/bin/kube-api-linter.so, plugin: not implemented The command is terminated due to an error: build linters: unable to load custom analyzer "kubeapilinter": tools/_output/bin/kube-api-linter.so, plugin: not implemented
Comment @coderabbitai help to get the list of available commands and usage tips.
/retest-required
/retest-required
The verification of this patch depends on https://github.com/kubernetes-sigs/controller-tools/pull/1324/files The validation order for the conditions are not guaranteed to be in order. This patches fixes that.
@jaypoulz: This pull request references OCPEDGE-2084 which is a valid jira issue.
Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.22.0" version, but no target version was set.
In response to this:
Introduces tnf.etcd.openshift.io/v1alpha1 API group with PacemakerStatus custom resource. This provides visibility into Pacemaker cluster health for dual-replica etcd deployments. The status-only resource is populated by a privileged controller and consumed by the cluster-etcd-operator healthcheck controller. Not gated because it's only used by CEO when two-node has transitioned.
Works in conjunction with https://github.com/openshift/cluster-etcd-operator/pull/1487
Depends on https://github.com/kubernetes-sigs/controller-tools/pull/1324
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.
@jaypoulz: This pull request references OCPEDGE-2084 which is a valid jira issue.
Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.22.0" version, but no target version was set.
In response to this:
Introduces tnf.etcd.openshift.io/v1alpha1 API group with PacemakerStatus custom resource. This provides visibility into Pacemaker cluster health for dual-replica etcd deployments. The status-only resource is populated by a privileged controller and consumed by the cluster-etcd-operator healthcheck controller. Not gated because it's only used by CEO when two-node has transitioned.
Works in conjunction with https://github.com/openshift/cluster-etcd-operator/pull/1487
Depends on https://github.com/kubernetes-sigs/controller-tools/pull/1324, which in turn needs a bump for go 1.25 In the meantime, I've vendored: https://github.com/jaypoulz/controller-tools/tree/strict-validation-ordering-v0.18
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.
@jaypoulz: This pull request references OCPEDGE-2084 which is a valid jira issue.
Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.22.0" version, but no target version was set.
In response to this:
Introduces tnf.etcd.openshift.io/v1alpha1 API group with PacemakerStatus custom resource. This provides visibility into Pacemaker cluster health for dual-replica etcd deployments. The status-only resource is populated by a privileged controller and consumed by the cluster-etcd-operator healthcheck controller. Not gated because it's only used by CEO when two-node has transitioned.
Works in conjunction with https://github.com/openshift/cluster-etcd-operator/pull/1487
Depends on https://github.com/kubernetes-sigs/controller-tools/pull/1324, which in turn needs a bump for go 1.25 In the meantime, I've vendored: https://github.com/openshift/kubernetes-sigs-controller-tools/pull/33
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.