cluster-logging-operator LOG-5426: Fix stale metrics in telemetry

Description

This PR aims to fix the issue of stale telemetry metrics present in the operator. It does this by creating a custom collector instead of relying on the standard instrumentation primitives (like GaugeVec), which keep the previous state of the metrics.

No changes are done to the metrics themselves (except for the help text) to avoid incompatibilities with previous telemetry queries.

/cc @cahartma /assign @alanconway

Known issues / ToDo

[x] One change to the telemetry is that Loki's "default outputs" now also count into the "default output" label of the log_forwarder_output_info metric.
[x] What this PR does not fix (yet) is that the metrics do not disappear when the ClusterLogging instance is removed. One way to solve this would be to only emit metrics when there's at least one ClusterLogging / ClusterLogForwarder instance available.
- The implementation now checks whether the different resources are available (even if for CLF and LFME none of the metrics are based on data in the resource)
[x] My favorite implementation would be to have the telemetry collect all the metrics directly from the resources (for an example, see LokiStack metrics), because the List call is essentially free (the list is cached locally). This would also solve the "stale metrics after delete" issue. The issue I have with implementing this approach in CLO is that I don't know how to identify the "healthy" status, for example. Input appreciated.
- I now fixed this for the ClusterLogging metric, ClusterLogForwarder is still based on the old model
[x] It looks to me as if the LogFileMetricExporter metric is not actually ingested into telemetry. If we do not use it anywhere, it might also be an option to remove it.
[x] Once everything is cleared up, this needs to be squashed :slightly_smiling_face:

Links

JIRA: LOG-5426

Apr 22 '24 18:04 xperimental

@xperimental: This pull request references LOG-5426 which is a valid jira issue.

In response to this:

Description

This PR aims to fix the issue of stale telemetry metrics present in the operator. It does this by creating a custom collector instead of relying on the standard instrumentation primitives (like GaugeVec), which keep the previous state of the metrics.

No changes are done to the metrics themselves (except for the help text) to avoid incompatibilities with previous telemetry queries.

/cc @cahartma /assign @alanconway

Known issues / ToDo

[ ] One change to the telemetry is that Loki's "default outputs" now also count into the "default output" label of the log_forwarder_output_info metric.

[ ] What this PR does not fix (yet) is that the metrics do not disappear when the ClusterLogging instance is removed. One way to solve this would be to only emit metrics when there's at least one ClusterLogging / ClusterLogForwarder instance available.

[ ] My favorite implementation would be to have the telemetry collect all the metrics directly from the resources (for an example, see LokiStack metrics), because the List call is essentially free (the list is cached locally). This would also solve the "stale metrics after delete" issue. The issue I have with implementing this approach in CLO is that I don't know how to identify the "healthy" status, for example. Input appreciated.

[ ] Once everything is cleared up, this needs to be squashed :slightly_smiling_face:

Links

JIRA: LOG-5426

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Apr 22 '24 18:04 openshift-ci-robot

@xperimental: This pull request references LOG-5426 which is a valid jira issue.

In response to this:

Description

This PR aims to fix the issue of stale telemetry metrics present in the operator. It does this by creating a custom collector instead of relying on the standard instrumentation primitives (like GaugeVec), which keep the previous state of the metrics.

No changes are done to the metrics themselves (except for the help text) to avoid incompatibilities with previous telemetry queries.

/cc @cahartma /assign @alanconway

Known issues / ToDo

[ ] One change to the telemetry is that Loki's "default outputs" now also count into the "default output" label of the log_forwarder_output_info metric.

[ ] What this PR does not fix (yet) is that the metrics do not disappear when the ClusterLogging instance is removed. One way to solve this would be to only emit metrics when there's at least one ClusterLogging / ClusterLogForwarder instance available.

[ ] My favorite implementation would be to have the telemetry collect all the metrics directly from the resources (for an example, see LokiStack metrics), because the List call is essentially free (the list is cached locally). This would also solve the "stale metrics after delete" issue. The issue I have with implementing this approach in CLO is that I don't know how to identify the "healthy" status, for example. Input appreciated.

[ ] Once everything is cleared up, this needs to be squashed :slightly_smiling_face:

Links

JIRA: LOG-5426

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Apr 22 '24 18:04 openshift-ci-robot

@xperimental: This pull request references LOG-5426 which is a valid jira issue.

In response to this:

Description

This PR aims to fix the issue of stale telemetry metrics present in the operator. It does this by creating a custom collector instead of relying on the standard instrumentation primitives (like GaugeVec), which keep the previous state of the metrics.

No changes are done to the metrics themselves (except for the help text) to avoid incompatibilities with previous telemetry queries.

/cc @cahartma /assign @alanconway

Known issues / ToDo

[ ] One change to the telemetry is that Loki's "default outputs" now also count into the "default output" label of the log_forwarder_output_info metric.

[x] What this PR does not fix (yet) is that the metrics do not disappear when the ClusterLogging instance is removed. One way to solve this would be to only emit metrics when there's at least one ClusterLogging / ClusterLogForwarder instance available.

[ ] My favorite implementation would be to have the telemetry collect all the metrics directly from the resources (for an example, see LokiStack metrics), because the List call is essentially free (the list is cached locally). This would also solve the "stale metrics after delete" issue. The issue I have with implementing this approach in CLO is that I don't know how to identify the "healthy" status, for example. Input appreciated.

[ ] Once everything is cleared up, this needs to be squashed :slightly_smiling_face:

Links

JIRA: LOG-5426

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Apr 23 '24 18:04 openshift-ci-robot

@xperimental: This pull request references LOG-5426 which is a valid jira issue.

In response to this:

Description

This PR aims to fix the issue of stale telemetry metrics present in the operator. It does this by creating a custom collector instead of relying on the standard instrumentation primitives (like GaugeVec), which keep the previous state of the metrics.

No changes are done to the metrics themselves (except for the help text) to avoid incompatibilities with previous telemetry queries.

/cc @cahartma /assign @alanconway

Known issues / ToDo

[ ] One change to the telemetry is that Loki's "default outputs" now also count into the "default output" label of the log_forwarder_output_info metric.

[x] What this PR does not fix (yet) is that the metrics do not disappear when the ClusterLogging instance is removed. One way to solve this would be to only emit metrics when there's at least one ClusterLogging / ClusterLogForwarder instance available.

[ ] My favorite implementation would be to have the telemetry collect all the metrics directly from the resources (for an example, see LokiStack metrics), because the List call is essentially free (the list is cached locally). This would also solve the "stale metrics after delete" issue. The issue I have with implementing this approach in CLO is that I don't know how to identify the "healthy" status, for example. Input appreciated.

I now fixed this for the ClusterLogging metric, ClusterLogForwarder is still based on the old model

[ ] It looks to me as if the LogFileMetricExporter metric is not actually ingested into telemetry. If we do not use it anywhere, it might also be an option to remove it.

[ ] Once everything is cleared up, this needs to be squashed :slightly_smiling_face:

Links

JIRA: LOG-5426

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Apr 23 '24 18:04 openshift-ci-robot

@xperimental: This pull request references LOG-5426 which is a valid jira issue.

In response to this:

Description

This PR aims to fix the issue of stale telemetry metrics present in the operator. It does this by creating a custom collector instead of relying on the standard instrumentation primitives (like GaugeVec), which keep the previous state of the metrics.

No changes are done to the metrics themselves (except for the help text) to avoid incompatibilities with previous telemetry queries.

/cc @cahartma /assign @alanconway

Known issues / ToDo

[ ] One change to the telemetry is that Loki's "default outputs" now also count into the "default output" label of the log_forwarder_output_info metric.

[ ] What this PR does not fix (yet) is that the metrics do not disappear when the ClusterLogging instance is removed. One way to solve this would be to only emit metrics when there's at least one ClusterLogging / ClusterLogForwarder instance available.

The implementation now checks whether the different resources are available (even if for CLF and LFME none of the metrics are based on data in the resource)

[ ] My favorite implementation would be to have the telemetry collect all the metrics directly from the resources (for an example, see LokiStack metrics), because the List call is essentially free (the list is cached locally). This would also solve the "stale metrics after delete" issue. The issue I have with implementing this approach in CLO is that I don't know how to identify the "healthy" status, for example. Input appreciated.

I now fixed this for the ClusterLogging metric, ClusterLogForwarder is still based on the old model

[ ] It looks to me as if the LogFileMetricExporter metric is not actually ingested into telemetry. If we do not use it anywhere, it might also be an option to remove it.

[ ] Once everything is cleared up, this needs to be squashed :slightly_smiling_face:

Links

JIRA: LOG-5426

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Apr 23 '24 18:04 openshift-ci-robot

Took another stab at some of the questions today and updated the PR code and description to reflect that.

Apr 23 '24 18:04 xperimental

/retest-required

Apr 23 '24 21:04 xperimental

/retest-required

Apr 24 '24 13:04 xperimental

Had a short chat about this topic with @cahartma . There are still some open points which I need to address, so this PR will change again on Monday.

Apr 26 '24 19:04 xperimental

/hold

Apr 26 '24 23:04 jcantrill

@xperimental: This pull request references LOG-5426 which is a valid jira issue.

In response to this:

Description

This PR aims to fix the issue of stale telemetry metrics present in the operator. It does this by creating a custom collector instead of relying on the standard instrumentation primitives (like GaugeVec), which keep the previous state of the metrics.

No changes are done to the metrics themselves (except for the help text) to avoid incompatibilities with previous telemetry queries.

/cc @cahartma /assign @alanconway

Known issues / ToDo

[x] One change to the telemetry is that Loki's "default outputs" now also count into the "default output" label of the log_forwarder_output_info metric.

[ ] What this PR does not fix (yet) is that the metrics do not disappear when the ClusterLogging instance is removed. One way to solve this would be to only emit metrics when there's at least one ClusterLogging / ClusterLogForwarder instance available.

The implementation now checks whether the different resources are available (even if for CLF and LFME none of the metrics are based on data in the resource)

[ ] My favorite implementation would be to have the telemetry collect all the metrics directly from the resources (for an example, see LokiStack metrics), because the List call is essentially free (the list is cached locally). This would also solve the "stale metrics after delete" issue. The issue I have with implementing this approach in CLO is that I don't know how to identify the "healthy" status, for example. Input appreciated.

I now fixed this for the ClusterLogging metric, ClusterLogForwarder is still based on the old model

[ ] It looks to me as if the LogFileMetricExporter metric is not actually ingested into telemetry. If we do not use it anywhere, it might also be an option to remove it.

[ ] Once everything is cleared up, this needs to be squashed :slightly_smiling_face:

Links

JIRA: LOG-5426

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Apr 29 '24 16:04 openshift-ci-robot

@xperimental: This pull request references LOG-5426 which is a valid jira issue.

In response to this:

Description

This PR aims to fix the issue of stale telemetry metrics present in the operator. It does this by creating a custom collector instead of relying on the standard instrumentation primitives (like GaugeVec), which keep the previous state of the metrics.

No changes are done to the metrics themselves (except for the help text) to avoid incompatibilities with previous telemetry queries.

/cc @cahartma /assign @alanconway

Known issues / ToDo

[x] One change to the telemetry is that Loki's "default outputs" now also count into the "default output" label of the log_forwarder_output_info metric.

[x] What this PR does not fix (yet) is that the metrics do not disappear when the ClusterLogging instance is removed. One way to solve this would be to only emit metrics when there's at least one ClusterLogging / ClusterLogForwarder instance available.

The implementation now checks whether the different resources are available (even if for CLF and LFME none of the metrics are based on data in the resource)

[x] My favorite implementation would be to have the telemetry collect all the metrics directly from the resources (for an example, see LokiStack metrics), because the List call is essentially free (the list is cached locally). This would also solve the "stale metrics after delete" issue. The issue I have with implementing this approach in CLO is that I don't know how to identify the "healthy" status, for example. Input appreciated.

I now fixed this for the ClusterLogging metric, ClusterLogForwarder is still based on the old model

[ ] It looks to me as if the LogFileMetricExporter metric is not actually ingested into telemetry. If we do not use it anywhere, it might also be an option to remove it.

[ ] Once everything is cleared up, this needs to be squashed :slightly_smiling_face:

Links

JIRA: LOG-5426

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Apr 29 '24 16:04 openshift-ci-robot

@xperimental: This pull request references LOG-5426 which is a valid jira issue.

In response to this:

Description

This PR aims to fix the issue of stale telemetry metrics present in the operator. It does this by creating a custom collector instead of relying on the standard instrumentation primitives (like GaugeVec), which keep the previous state of the metrics.

No changes are done to the metrics themselves (except for the help text) to avoid incompatibilities with previous telemetry queries.

/cc @cahartma /assign @alanconway

Known issues / ToDo

[x] One change to the telemetry is that Loki's "default outputs" now also count into the "default output" label of the log_forwarder_output_info metric.

[x] What this PR does not fix (yet) is that the metrics do not disappear when the ClusterLogging instance is removed. One way to solve this would be to only emit metrics when there's at least one ClusterLogging / ClusterLogForwarder instance available.

The implementation now checks whether the different resources are available (even if for CLF and LFME none of the metrics are based on data in the resource)

[x] My favorite implementation would be to have the telemetry collect all the metrics directly from the resources (for an example, see LokiStack metrics), because the List call is essentially free (the list is cached locally). This would also solve the "stale metrics after delete" issue. The issue I have with implementing this approach in CLO is that I don't know how to identify the "healthy" status, for example. Input appreciated.

I now fixed this for the ClusterLogging metric, ClusterLogForwarder is still based on the old model

[x] It looks to me as if the LogFileMetricExporter metric is not actually ingested into telemetry. If we do not use it anywhere, it might also be an option to remove it.

[ ] Once everything is cleared up, this needs to be squashed :slightly_smiling_face:

Links

JIRA: LOG-5426

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Apr 29 '24 17:04 openshift-ci-robot

@xperimental: This pull request references LOG-5426 which is a valid jira issue.

In response to this:

Description

This PR aims to fix the issue of stale telemetry metrics present in the operator. It does this by creating a custom collector instead of relying on the standard instrumentation primitives (like GaugeVec), which keep the previous state of the metrics.

No changes are done to the metrics themselves (except for the help text) to avoid incompatibilities with previous telemetry queries.

/cc @cahartma /assign @alanconway

Known issues / ToDo

[x] One change to the telemetry is that Loki's "default outputs" now also count into the "default output" label of the log_forwarder_output_info metric.

[x] What this PR does not fix (yet) is that the metrics do not disappear when the ClusterLogging instance is removed. One way to solve this would be to only emit metrics when there's at least one ClusterLogging / ClusterLogForwarder instance available.

The implementation now checks whether the different resources are available (even if for CLF and LFME none of the metrics are based on data in the resource)

[x] My favorite implementation would be to have the telemetry collect all the metrics directly from the resources (for an example, see LokiStack metrics), because the List call is essentially free (the list is cached locally). This would also solve the "stale metrics after delete" issue. The issue I have with implementing this approach in CLO is that I don't know how to identify the "healthy" status, for example. Input appreciated.

I now fixed this for the ClusterLogging metric, ClusterLogForwarder is still based on the old model

[x] It looks to me as if the LogFileMetricExporter metric is not actually ingested into telemetry. If we do not use it anywhere, it might also be an option to remove it.

[x] Once everything is cleared up, this needs to be squashed :slightly_smiling_face:

Links

JIRA: LOG-5426

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Apr 29 '24 18:04 openshift-ci-robot

Ready for review again, now already squashed as discussed :slightly_smiling_face:

Apr 29 '24 18:04 xperimental

It looks to me as if builds fail because of the recent change to Go 1.21 . There's already a PR to fix that issue, so I'll probably need a rebase after that one merges. :crossed_fingers:

Apr 29 '24 18:04 xperimental

/approve

Apr 30 '24 16:04 jcantrill

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: alanconway, jcantrill, xperimental

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [alanconway,jcantrill,xperimental]

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

May 09 '24 14:05 openshift-ci[bot]

/unhold

May 14 '24 15:05 xperimental

@xperimental: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

May 14 '24 17:05 openshift-ci[bot]