api icon indicating copy to clipboard operation
api copied to clipboard

OCPEDGE-2084: Add PacemakerStatus CRD for two-node fencing

Open jaypoulz opened this issue 3 months ago • 33 comments

Introduces tnf.etcd.openshift.io/v1alpha1 API group with PacemakerStatus custom resource. This provides visibility into Pacemaker cluster health for dual-replica etcd deployments. The status-only resource is populated by a privileged controller and consumed by the cluster-etcd-operator healthcheck controller. Not gated because it's only used by CEO when two-node has transitioned.

Works in conjunction with https://github.com/openshift/cluster-etcd-operator/pull/1487

Depends on https://github.com/kubernetes-sigs/controller-tools/pull/1324, which in turn needs a bump for go 1.25 In the meantime, I've vendored: https://github.com/openshift/kubernetes-sigs-controller-tools/pull/33

jaypoulz avatar Oct 21 '25 18:10 jaypoulz

@jaypoulz: This pull request references OCPEDGE-2084 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.21.0" version, but no target version was set.

In response to this:

Introduces tnf.etcd.openshift.io/v1alpha1 API group with PacemakerStatus custom resource. This provides visibility into Pacemaker cluster health for dual-replica etcd deployments. The status-only resource is populated by a privileged controller and consumed by the cluster-etcd-operator healthcheck controller. Gated by DualReplica feature and managed by two-node-fencing component.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci-robot avatar Oct 21 '25 18:10 openshift-ci-robot

Hello @jaypoulz! Some important instructions when contributing to openshift/api: API design plays an important part in the user experience of OpenShift and as such API PRs are subject to a high level of scrutiny to ensure they follow our best practices. If you haven't already done so, please review the OpenShift API Conventions and ensure that your proposed changes are compliant. Following these conventions will help expedite the api review process for your PR.

openshift-ci[bot] avatar Oct 21 '25 18:10 openshift-ci[bot]

@jaypoulz: This pull request references OCPEDGE-2084 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.21.0" version, but no target version was set.

In response to this:

Introduces tnf.etcd.openshift.io/v1alpha1 API group with PacemakerStatus custom resource. This provides visibility into Pacemaker cluster health for dual-replica etcd deployments. The status-only resource is populated by a privileged controller and consumed by the cluster-etcd-operator healthcheck controller. Gated by DualReplica feature and managed by two-node-fencing component.

Works in conjunction with https://github.com/openshift/cluster-etcd-operator/pull/1487

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci-robot avatar Oct 21 '25 18:10 openshift-ci-robot

@jaypoulz thank you for the PR, do you mind making the CI happy?

saschagrunert avatar Oct 22 '25 14:10 saschagrunert

Hi @saschagrunert :) Working on it! :D New to this repo so working through beginner challenges :smile_cat:

jaypoulz avatar Oct 22 '25 14:10 jaypoulz

A few open questions I have:

  1. This is a config object of a sort. It's created by cluster-etcd-operator only when you have a two-node cluster and only for the purposes of gathering information about the health of pacemaker (our ha tool) from the nodes. I put it in etcd/tnf (two node fencing) because it seemed sensible. But I'm not sure if it needs to be in config.

That said, it doesn't work like a normal config - there's no spec and it shouldn't be created during bootstrap. The CRD just needs to be present when the CEO runs an cronjob to post an update to it.

  1. bash hack/update-protobuf.sh failed for me because it wanted the path to be installed in my go path. That said, cursor happily runs it and copies over the files without issue. I'm just skeptical of the zz_generated files, but I assume those are verified by CI?

  2. For the non-boolean enum fields. Should I be creating static string definitions that can be exported to CEO? How do I generate those?

jaypoulz avatar Oct 22 '25 14:10 jaypoulz

Yeah, I'll ignore the CI failures for now, running ./hack/update-codegen.sh locally also gives me a diff in openapi/generated_openapi/zz_generated.openapi.go. :upside_down_face:

A few open questions I have:

  1. This is a config object of a sort. It's created by cluster-etcd-operator only when you have a two-node cluster and only for the purposes of gathering information about the health of pacemaker (our ha tool) from the nodes. I put it in etcd/tnf (two node fencing) because it seemed sensible. But I'm not sure if it needs to be in config.

I'm new to API review, but my gut feeling tells me that a dedicated etcd API group sounds fine for that purpose.

That said, it doesn't work like a normal config - there's no spec and it shouldn't be created during bootstrap. The CRD just needs to be present when the CEO runs an cronjob to post an update to it.

  1. bash hack/update-protobuf.sh failed for me because it wanted the path to be installed in my go path. That said, cursor happily runs it and copies over the files without issue. I'm just skeptical of the zz_generated files, but I assume those are verified by CI?

You can also try to run it in a container by make verify-with-container.

  1. For the non-boolean enum fields. Should I be creating static string definitions that can be exported to CEO? How do I generate those?

Do you mind elaborating on that? Do you mean generating the code for the unions?

API docs ref: https://github.com/openshift/enhancements/blob/master/dev-guide/api-conventions.md#writing-a-union-in-go


@jaypoulz is there an OpenShift enhancement available for this change?

saschagrunert avatar Oct 23 '25 12:10 saschagrunert

@jaypoulz: This pull request references OCPEDGE-2084 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.21.0" version, but no target version was set.

In response to this:

Introduces tnf.etcd.openshift.io/v1alpha1 API group with PacemakerStatus custom resource. This provides visibility into Pacemaker cluster health for dual-replica etcd deployments. The status-only resource is populated by a privileged controller and consumed by the cluster-etcd-operator healthcheck controller. Not gated because it's only used

Works in conjunction with https://github.com/openshift/cluster-etcd-operator/pull/1487

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci-robot avatar Oct 23 '25 12:10 openshift-ci-robot

@jaypoulz: This pull request references OCPEDGE-2084 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.21.0" version, but no target version was set.

In response to this:

Introduces tnf.etcd.openshift.io/v1alpha1 API group with PacemakerStatus custom resource. This provides visibility into Pacemaker cluster health for dual-replica etcd deployments. The status-only resource is populated by a privileged controller and consumed by the cluster-etcd-operator healthcheck controller. Not gated because it's only used by CEO when two-node has transitioned.

Works in conjunction with https://github.com/openshift/cluster-etcd-operator/pull/1487

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci-robot avatar Oct 23 '25 12:10 openshift-ci-robot

The high level feature is explained here: https://github.com/openshift/enhancements/blob/91301b6458c6a4730708c9b04a24457df3a8b1e2/enhancements/two-node-fencing/tnf.md#L4

The functionality I'm working on is explained in a little bit of detail here: https://issues.redhat.com/browse/OCPEDGE-2097

From a high level - two node fencing relies on a program (pacemaker) to understand the status of etcd in the cluster because it can lose quorum at any time if one of the two nodes fail. This API is used to post status updates about pacemaker's health in a way that will only register while you have quorum - so it's almost 100% for informational purposes. The only time we'll reference it outside of just alerting the end user of CEO going degraded when something is unhealth is during node-replacement events, where CEO needs to know the nodes and IPs in the pacemaker cluster in order to safely allow or deny a node replacement scenario.

jaypoulz avatar Oct 23 '25 12:10 jaypoulz

What I was asking about was: https://github.com/openshift/enhancements/blob/master/dev-guide/api-conventions.md#do-not-use-boolean-fields

Do we usually provide constants for the non-boolean fields for easy reference?

jaypoulz avatar Oct 23 '25 17:10 jaypoulz

  1. For the non-boolean enum fields. Should I be creating static string definitions that can be exported to CEO? How do I generate those?

Do you mind elaborating on that? Do you mean generating the code for the unions?

API docs ref: https://github.com/openshift/enhancements/blob/master/dev-guide/api-conventions.md#writing-a-union-in-go

I saw the kinds of constants I was thinking about in the control plan topology type, so I decided to proceed in that direction. Should be more obvious what I meant now. :)

jaypoulz avatar Oct 23 '25 22:10 jaypoulz

@saschagrunert CI is happy :face_holding_back_tears:

jaypoulz avatar Oct 23 '25 22:10 jaypoulz

Or at least it was when I wrote that comment - I decided to update the READMEs. :crossed_fingers: I didn't break anything

jaypoulz avatar Oct 23 '25 23:10 jaypoulz

/test okd-scos-e2e-aws-ovn

saschagrunert avatar Oct 24 '25 08:10 saschagrunert

@saschagrunert I think I hit all of your comments. I've also asked pacemaker expert CLumens from the RHEL team to make sure I wasn't misrepresenting anything in the new spec.

jaypoulz avatar Oct 28 '25 00:10 jaypoulz

/retest

saschagrunert avatar Oct 28 '25 08:10 saschagrunert

/retest

saschagrunert avatar Oct 29 '25 07:10 saschagrunert

Since @saschagrunert has said this is good from his side, I'll now take over the API review. Since it's shift week, I'm not expecting to pick this up until Monday

JoelSpeed avatar Oct 29 '25 10:10 JoelSpeed

Sounds good to me! :)

jaypoulz avatar Oct 29 '25 13:10 jaypoulz

/retest-required

jaypoulz avatar Nov 04 '25 19:11 jaypoulz

/retest-required

jaypoulz avatar Nov 04 '25 19:11 jaypoulz

/retest-required

jaypoulz avatar Nov 05 '25 22:11 jaypoulz

📝 Walkthrough

Walkthrough

Adds API group etcd.openshift.io v1alpha1 and registers it with the scheme. Introduces a cluster-scoped PacemakerCluster CRD (singleton required to be named "cluster") with a status subresource and a detailed status model (cluster, node, resource, fencing-agent types and enums) plus extensive OpenAPI validations. Adds CRD manifests (multiple feature-gated variants), generated deepcopy, OpenAPI/Swagger doc and schema wiring, feature-gated CRD listings, tests and test fixtures, a Makefile test target, README documentation, payload-manifest inclusion, and wiring in global install to call etcd.Install.

🚥 Pre-merge checks | ✅ 2 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 7.14% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The PR title 'Add PacemakerStatus CRD for two-node fencing' directly references the main change: introduction of a PacemakerCluster CRD for managing Pacemaker cluster health.
Description check ✅ Passed The PR description comprehensively explains the purpose, context, and dependencies of the PacemakerStatus CRD addition for dual-replica etcd deployments.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • [ ] 📝 Generate docstrings

[!WARNING] There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 golangci-lint (2.5.0)

Error: build linters: unable to load custom analyzer "kubeapilinter": tools/_output/bin/kube-api-linter.so, plugin: not implemented The command is terminated due to an error: build linters: unable to load custom analyzer "kubeapilinter": tools/_output/bin/kube-api-linter.so, plugin: not implemented


Comment @coderabbitai help to get the list of available commands and usage tips.

coderabbitai[bot] avatar Dec 02 '25 21:12 coderabbitai[bot]

/retest-required

jaypoulz avatar Dec 17 '25 18:12 jaypoulz

/retest-required

jaypoulz avatar Jan 08 '26 18:01 jaypoulz

The verification of this patch depends on https://github.com/kubernetes-sigs/controller-tools/pull/1324/files The validation order for the conditions are not guaranteed to be in order. This patches fixes that.

jaypoulz avatar Jan 15 '26 19:01 jaypoulz

@jaypoulz: This pull request references OCPEDGE-2084 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.22.0" version, but no target version was set.

In response to this:

Introduces tnf.etcd.openshift.io/v1alpha1 API group with PacemakerStatus custom resource. This provides visibility into Pacemaker cluster health for dual-replica etcd deployments. The status-only resource is populated by a privileged controller and consumed by the cluster-etcd-operator healthcheck controller. Not gated because it's only used by CEO when two-node has transitioned.

Works in conjunction with https://github.com/openshift/cluster-etcd-operator/pull/1487

Depends on https://github.com/kubernetes-sigs/controller-tools/pull/1324

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci-robot avatar Jan 15 '26 19:01 openshift-ci-robot

@jaypoulz: This pull request references OCPEDGE-2084 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.22.0" version, but no target version was set.

In response to this:

Introduces tnf.etcd.openshift.io/v1alpha1 API group with PacemakerStatus custom resource. This provides visibility into Pacemaker cluster health for dual-replica etcd deployments. The status-only resource is populated by a privileged controller and consumed by the cluster-etcd-operator healthcheck controller. Not gated because it's only used by CEO when two-node has transitioned.

Works in conjunction with https://github.com/openshift/cluster-etcd-operator/pull/1487

Depends on https://github.com/kubernetes-sigs/controller-tools/pull/1324, which in turn needs a bump for go 1.25 In the meantime, I've vendored: https://github.com/jaypoulz/controller-tools/tree/strict-validation-ordering-v0.18

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci-robot avatar Jan 15 '26 20:01 openshift-ci-robot

@jaypoulz: This pull request references OCPEDGE-2084 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.22.0" version, but no target version was set.

In response to this:

Introduces tnf.etcd.openshift.io/v1alpha1 API group with PacemakerStatus custom resource. This provides visibility into Pacemaker cluster health for dual-replica etcd deployments. The status-only resource is populated by a privileged controller and consumed by the cluster-etcd-operator healthcheck controller. Not gated because it's only used by CEO when two-node has transitioned.

Works in conjunction with https://github.com/openshift/cluster-etcd-operator/pull/1487

Depends on https://github.com/kubernetes-sigs/controller-tools/pull/1324, which in turn needs a bump for go 1.25 In the meantime, I've vendored: https://github.com/openshift/kubernetes-sigs-controller-tools/pull/33

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci-robot avatar Jan 15 '26 20:01 openshift-ci-robot