api OCPEDGE-2084: Add PacemakerStatus CRD for two-node fencing

Introduces tnf.etcd.openshift.io/v1alpha1 API group with PacemakerStatus custom resource. This provides visibility into Pacemaker cluster health for dual-replica etcd deployments. The status-only resource is populated by a privileged controller and consumed by the cluster-etcd-operator healthcheck controller. Not gated because it's only used by CEO when two-node has transitioned.

Works in conjunction with https://github.com/openshift/cluster-etcd-operator/pull/1487

Depends on https://github.com/kubernetes-sigs/controller-tools/pull/1324, which in turn needs a bump for go 1.25 In the meantime, I've vendored: https://github.com/openshift/kubernetes-sigs-controller-tools/pull/33

Oct 21 '25 18:10 jaypoulz

@jaypoulz: This pull request references OCPEDGE-2084 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.21.0" version, but no target version was set.

In response to this:

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Oct 21 '25 18:10 openshift-ci-robot

Hello @jaypoulz! Some important instructions when contributing to openshift/api: API design plays an important part in the user experience of OpenShift and as such API PRs are subject to a high level of scrutiny to ensure they follow our best practices. If you haven't already done so, please review the OpenShift API Conventions and ensure that your proposed changes are compliant. Following these conventions will help expedite the api review process for your PR.

Oct 21 '25 18:10 openshift-ci[bot]

@jaypoulz: This pull request references OCPEDGE-2084 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.21.0" version, but no target version was set.

In response to this:

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Oct 21 '25 18:10 openshift-ci-robot

@jaypoulz thank you for the PR, do you mind making the CI happy?

Oct 22 '25 14:10 saschagrunert

Hi @saschagrunert :) Working on it! :D New to this repo so working through beginner challenges :smile_cat:

Oct 22 '25 14:10 jaypoulz

A few open questions I have:

This is a config object of a sort. It's created by cluster-etcd-operator only when you have a two-node cluster and only for the purposes of gathering information about the health of pacemaker (our ha tool) from the nodes. I put it in etcd/tnf (two node fencing) because it seemed sensible. But I'm not sure if it needs to be in config.

That said, it doesn't work like a normal config - there's no spec and it shouldn't be created during bootstrap. The CRD just needs to be present when the CEO runs an cronjob to post an update to it.

bash hack/update-protobuf.sh failed for me because it wanted the path to be installed in my go path. That said, cursor happily runs it and copies over the files without issue. I'm just skeptical of the zz_generated files, but I assume those are verified by CI?
For the non-boolean enum fields. Should I be creating static string definitions that can be exported to CEO? How do I generate those?

Oct 22 '25 14:10 jaypoulz

Yeah, I'll ignore the CI failures for now, running ./hack/update-codegen.sh locally also gives me a diff in openapi/generated_openapi/zz_generated.openapi.go. :upside_down_face:

I'm new to API review, but my gut feeling tells me that a dedicated etcd API group sounds fine for that purpose.

You can also try to run it in a container by make verify-with-container.

Do you mind elaborating on that? Do you mean generating the code for the unions?

API docs ref: https://github.com/openshift/enhancements/blob/master/dev-guide/api-conventions.md#writing-a-union-in-go

@jaypoulz is there an OpenShift enhancement available for this change?

Oct 23 '25 12:10 saschagrunert

@jaypoulz: This pull request references OCPEDGE-2084 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.21.0" version, but no target version was set.

In response to this:

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Oct 23 '25 12:10 openshift-ci-robot

@jaypoulz: This pull request references OCPEDGE-2084 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.21.0" version, but no target version was set.

In response to this:

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Oct 23 '25 12:10 openshift-ci-robot

The high level feature is explained here: https://github.com/openshift/enhancements/blob/91301b6458c6a4730708c9b04a24457df3a8b1e2/enhancements/two-node-fencing/tnf.md#L4

The functionality I'm working on is explained in a little bit of detail here: https://issues.redhat.com/browse/OCPEDGE-2097

From a high level - two node fencing relies on a program (pacemaker) to understand the status of etcd in the cluster because it can lose quorum at any time if one of the two nodes fail. This API is used to post status updates about pacemaker's health in a way that will only register while you have quorum - so it's almost 100% for informational purposes. The only time we'll reference it outside of just alerting the end user of CEO going degraded when something is unhealth is during node-replacement events, where CEO needs to know the nodes and IPs in the pacemaker cluster in order to safely allow or deny a node replacement scenario.

Oct 23 '25 12:10 jaypoulz

What I was asking about was: https://github.com/openshift/enhancements/blob/master/dev-guide/api-conventions.md#do-not-use-boolean-fields

Do we usually provide constants for the non-boolean fields for easy reference?

Oct 23 '25 17:10 jaypoulz

I saw the kinds of constants I was thinking about in the control plan topology type, so I decided to proceed in that direction. Should be more obvious what I meant now. :)

Oct 23 '25 22:10 jaypoulz

@saschagrunert CI is happy :face_holding_back_tears:

Oct 23 '25 22:10 jaypoulz

Or at least it was when I wrote that comment - I decided to update the READMEs. :crossed_fingers: I didn't break anything

Oct 23 '25 23:10 jaypoulz

/test okd-scos-e2e-aws-ovn

Oct 24 '25 08:10 saschagrunert

@saschagrunert I think I hit all of your comments. I've also asked pacemaker expert CLumens from the RHEL team to make sure I wasn't misrepresenting anything in the new spec.

Oct 28 '25 00:10 jaypoulz

/retest

Oct 28 '25 08:10 saschagrunert

/retest

Oct 29 '25 07:10 saschagrunert

Since @saschagrunert has said this is good from his side, I'll now take over the API review. Since it's shift week, I'm not expecting to pick this up until Monday

Oct 29 '25 10:10 JoelSpeed

Sounds good to me! :)

Oct 29 '25 13:10 jaypoulz

/retest-required

Nov 04 '25 19:11 jaypoulz

/retest-required

Nov 04 '25 19:11 jaypoulz

/retest-required

Nov 05 '25 22:11 jaypoulz

📝 Walkthrough

Walkthrough

Adds API group etcd.openshift.io v1alpha1 and registers it with the scheme. Introduces a cluster-scoped PacemakerCluster CRD (singleton required to be named "cluster") with a status subresource and a detailed status model (cluster, node, resource, fencing-agent types and enums) plus extensive OpenAPI validations. Adds CRD manifests (multiple feature-gated variants), generated deepcopy, OpenAPI/Swagger doc and schema wiring, feature-gated CRD listings, tests and test fixtures, a Makefile test target, README documentation, payload-manifest inclusion, and wiring in global install to call etcd.Install.

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 7.14% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The PR title 'Add PacemakerStatus CRD for two-node fencing' directly references the main change: introduction of a PacemakerCluster CRD for managing Pacemaker cluster health.
Description check	✅ Passed	The PR description comprehensively explains the purpose, context, and dependencies of the PacemakerStatus CRD addition for dual-replica etcd deployments.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

[ ] 📝 Generate docstrings

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Dec 02 '25 21:12 coderabbitai[bot]

/retest-required

Dec 17 '25 18:12 jaypoulz

/retest-required

Jan 08 '26 18:01 jaypoulz

The verification of this patch depends on https://github.com/kubernetes-sigs/controller-tools/pull/1324/files The validation order for the conditions are not guaranteed to be in order. This patches fixes that.

Jan 15 '26 19:01 jaypoulz

@jaypoulz: This pull request references OCPEDGE-2084 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.22.0" version, but no target version was set.

In response to this:

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Jan 15 '26 19:01 openshift-ci-robot

@jaypoulz: This pull request references OCPEDGE-2084 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.22.0" version, but no target version was set.

In response to this:

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Jan 15 '26 20:01 openshift-ci-robot

@jaypoulz: This pull request references OCPEDGE-2084 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.22.0" version, but no target version was set.

In response to this:

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Jan 15 '26 20:01 openshift-ci-robot