managed-cluster-config Added package-operator metrics

What type of PR is this?

(bug/feature/cleanup/documentation)

What this PR does / why we need it?

Add package operator metrics to list of scraped metrics so alerting can be set up.

Which Jira/Github issue(s) this PR fixes?

https://issues.redhat.com/browse/PKO-10

Special notes for your reviewer:

Pre-checks (if applicable):

[ ] Tested latest changes against a cluster
[ ] Included documentation changes with PR
[ ] If this is a new object that is not intended for the FedRAMP environment (if unsure, please reach out to team FedRAMP), please exclude it with:
```
matchExpressions:
- key: api.openshift.com/fedramp
  operator: NotIn
  values: ["true"]
```

Aug 27 '24 11:08 robshelly

/retest

Oct 02 '24 08:10 Nanyte25

/retest

Nov 18 '24 10:11 kostola

@robshelly I think you need to re-run make and check in the resulting changes.

From the pipeline logs:

10:52:01 Running 'make' caused changes.  Run 'make' and commit changes to the PR to try again. If you're removing ACM policies, you need to remove the generated file from deploy/acm-policies/50-GENERATED-* before running 'make'.

Nov 18 '24 12:11 erdii

/assign @zmird-r

Mar 04 '25 10:03 robshelly

@robshelly can you give us a rough overview of the cardinality / count of time series per cluster that this would ingest additionally into the osd tenant? Where would the alerts land?

From the rhobs team that manages the tenant:

in general a few 100s of timeseries won’t causes issues. But if it’s a change in 1000s range, just a small heads up in #rhobs-support works

Mar 14 '25 14:03 typeid

@typeid This is the estimate I provided to the RHOBS team. Per cluster: ~< 20 vectors from package-operator ~< 15 vectors per package deployed on OSD clusters there's currently 2 packages on hypershift clusters the currently 7 packages

The alerts are for the LPSRE team to monitor package operator via Pagerduty.

Mar 19 '25 12:03 robshelly

So roughly 50 time series per classic cluster and ~100 time series per HCP in the current state? That would then be roughly 100k time series for the whole fleet of classic clusters? Do we even already have alerting for the osd-tenant?

@saswatamcode correct me if I'm wrong, but that sounds like too much.

@robshelly @erdii feel free to put a sync in my calendar and we can work out the alerting for PKO, there's a good chance we don't need to go through the osd rhobs tenant for this.

Mar 19 '25 13:03 typeid

Had a chat with @saswatamcode. RHOBS could scale to handle the extra series - we would have to tell them in advance when the metrics land, not a problem in general.

However, the current utilization for osd-observatorium-prod is at 600k series (with 3x replica), so adding 100k series would be an extra 300k series with x3 replica. A 50% increase for a single operator. IMHO we should rethink and update cardinality for what we want to ship.

Mar 27 '25 09:03 typeid

@robshelly: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Apr 02 '25 11:04 openshift-ci[bot]

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: robshelly, typeid

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [typeid]

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

Apr 02 '25 11:04 openshift-ci[bot]