LOG-5426: Fix stale metrics in telemetry
Description
This PR aims to fix the issue of stale telemetry metrics present in the operator. It does this by creating a custom collector instead of relying on the standard instrumentation primitives (like GaugeVec), which keep the previous state of the metrics.
No changes are done to the metrics themselves (except for the help text) to avoid incompatibilities with previous telemetry queries.
/cc @cahartma /assign @alanconway
Known issues / ToDo
- [x] One change to the telemetry is that Loki's "default outputs" now also count into the "default output" label of the
log_forwarder_output_infometric. - [x] What this PR does not fix (yet) is that the metrics do not disappear when the ClusterLogging instance is removed. One way to solve this would be to only emit metrics when there's at least one ClusterLogging / ClusterLogForwarder instance available.
- The implementation now checks whether the different resources are available (even if for CLF and LFME none of the metrics are based on data in the resource)
- [x] My favorite implementation would be to have the telemetry collect all the metrics directly from the resources (for an example, see LokiStack metrics), because the
Listcall is essentially free (the list is cached locally). This would also solve the "stale metrics after delete" issue. The issue I have with implementing this approach in CLO is that I don't know how to identify the "healthy" status, for example. Input appreciated.- I now fixed this for the ClusterLogging metric, ClusterLogForwarder is still based on the old model
- [x] It looks to me as if the LogFileMetricExporter metric is not actually ingested into telemetry. If we do not use it anywhere, it might also be an option to remove it.
- [x] Once everything is cleared up, this needs to be squashed :slightly_smiling_face:
Links
- JIRA: LOG-5426
@xperimental: This pull request references LOG-5426 which is a valid jira issue.
In response to this:
Description
This PR aims to fix the issue of stale telemetry metrics present in the operator. It does this by creating a custom collector instead of relying on the standard instrumentation primitives (like
GaugeVec), which keep the previous state of the metrics.No changes are done to the metrics themselves (except for the help text) to avoid incompatibilities with previous telemetry queries.
/cc @cahartma /assign @alanconway
Known issues / ToDo
- [ ] One change to the telemetry is that Loki's "default outputs" now also count into the "default output" label of the
log_forwarder_output_infometric.- [ ] What this PR does not fix (yet) is that the metrics do not disappear when the ClusterLogging instance is removed. One way to solve this would be to only emit metrics when there's at least one ClusterLogging / ClusterLogForwarder instance available.
- [ ] My favorite implementation would be to have the telemetry collect all the metrics directly from the resources (for an example, see LokiStack metrics), because the
Listcall is essentially free (the list is cached locally). This would also solve the "stale metrics after delete" issue. The issue I have with implementing this approach in CLO is that I don't know how to identify the "healthy" status, for example. Input appreciated.- [ ] Once everything is cleared up, this needs to be squashed :slightly_smiling_face:
Links
- JIRA: LOG-5426
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.
@xperimental: This pull request references LOG-5426 which is a valid jira issue.
In response to this:
Description
This PR aims to fix the issue of stale telemetry metrics present in the operator. It does this by creating a custom collector instead of relying on the standard instrumentation primitives (like
GaugeVec), which keep the previous state of the metrics.No changes are done to the metrics themselves (except for the help text) to avoid incompatibilities with previous telemetry queries.
/cc @cahartma /assign @alanconway
Known issues / ToDo
- [ ] One change to the telemetry is that Loki's "default outputs" now also count into the "default output" label of the
log_forwarder_output_infometric.- [ ] What this PR does not fix (yet) is that the metrics do not disappear when the ClusterLogging instance is removed. One way to solve this would be to only emit metrics when there's at least one ClusterLogging / ClusterLogForwarder instance available.
- [ ] My favorite implementation would be to have the telemetry collect all the metrics directly from the resources (for an example, see LokiStack metrics), because the
Listcall is essentially free (the list is cached locally). This would also solve the "stale metrics after delete" issue. The issue I have with implementing this approach in CLO is that I don't know how to identify the "healthy" status, for example. Input appreciated.- [ ] Once everything is cleared up, this needs to be squashed :slightly_smiling_face:
Links
- JIRA: LOG-5426
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.
@xperimental: This pull request references LOG-5426 which is a valid jira issue.
In response to this:
Description
This PR aims to fix the issue of stale telemetry metrics present in the operator. It does this by creating a custom collector instead of relying on the standard instrumentation primitives (like
GaugeVec), which keep the previous state of the metrics.No changes are done to the metrics themselves (except for the help text) to avoid incompatibilities with previous telemetry queries.
/cc @cahartma /assign @alanconway
Known issues / ToDo
- [ ] One change to the telemetry is that Loki's "default outputs" now also count into the "default output" label of the
log_forwarder_output_infometric.- [x] What this PR does not fix (yet) is that the metrics do not disappear when the ClusterLogging instance is removed. One way to solve this would be to only emit metrics when there's at least one ClusterLogging / ClusterLogForwarder instance available.
- [ ] My favorite implementation would be to have the telemetry collect all the metrics directly from the resources (for an example, see LokiStack metrics), because the
Listcall is essentially free (the list is cached locally). This would also solve the "stale metrics after delete" issue. The issue I have with implementing this approach in CLO is that I don't know how to identify the "healthy" status, for example. Input appreciated.- [ ] Once everything is cleared up, this needs to be squashed :slightly_smiling_face:
Links
- JIRA: LOG-5426
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.
@xperimental: This pull request references LOG-5426 which is a valid jira issue.
In response to this:
Description
This PR aims to fix the issue of stale telemetry metrics present in the operator. It does this by creating a custom collector instead of relying on the standard instrumentation primitives (like
GaugeVec), which keep the previous state of the metrics.No changes are done to the metrics themselves (except for the help text) to avoid incompatibilities with previous telemetry queries.
/cc @cahartma /assign @alanconway
Known issues / ToDo
- [ ] One change to the telemetry is that Loki's "default outputs" now also count into the "default output" label of the
log_forwarder_output_infometric.- [x] What this PR does not fix (yet) is that the metrics do not disappear when the ClusterLogging instance is removed. One way to solve this would be to only emit metrics when there's at least one ClusterLogging / ClusterLogForwarder instance available.
- [ ] My favorite implementation would be to have the telemetry collect all the metrics directly from the resources (for an example, see LokiStack metrics), because the
Listcall is essentially free (the list is cached locally). This would also solve the "stale metrics after delete" issue. The issue I have with implementing this approach in CLO is that I don't know how to identify the "healthy" status, for example. Input appreciated.- I now fixed this for the ClusterLogging metric, ClusterLogForwarder is still based on the old model
- [ ] It looks to me as if the LogFileMetricExporter metric is not actually ingested into telemetry. If we do not use it anywhere, it might also be an option to remove it.
- [ ] Once everything is cleared up, this needs to be squashed :slightly_smiling_face:
Links
- JIRA: LOG-5426
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.
@xperimental: This pull request references LOG-5426 which is a valid jira issue.
In response to this:
Description
This PR aims to fix the issue of stale telemetry metrics present in the operator. It does this by creating a custom collector instead of relying on the standard instrumentation primitives (like
GaugeVec), which keep the previous state of the metrics.No changes are done to the metrics themselves (except for the help text) to avoid incompatibilities with previous telemetry queries.
/cc @cahartma /assign @alanconway
Known issues / ToDo
- [ ] One change to the telemetry is that Loki's "default outputs" now also count into the "default output" label of the
log_forwarder_output_infometric.- [ ] What this PR does not fix (yet) is that the metrics do not disappear when the ClusterLogging instance is removed. One way to solve this would be to only emit metrics when there's at least one ClusterLogging / ClusterLogForwarder instance available.
- The implementation now checks whether the different resources are available (even if for CLF and LFME none of the metrics are based on data in the resource)
- [ ] My favorite implementation would be to have the telemetry collect all the metrics directly from the resources (for an example, see LokiStack metrics), because the
Listcall is essentially free (the list is cached locally). This would also solve the "stale metrics after delete" issue. The issue I have with implementing this approach in CLO is that I don't know how to identify the "healthy" status, for example. Input appreciated.- I now fixed this for the ClusterLogging metric, ClusterLogForwarder is still based on the old model
- [ ] It looks to me as if the LogFileMetricExporter metric is not actually ingested into telemetry. If we do not use it anywhere, it might also be an option to remove it.
- [ ] Once everything is cleared up, this needs to be squashed :slightly_smiling_face:
Links
- JIRA: LOG-5426
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.
Took another stab at some of the questions today and updated the PR code and description to reflect that.
/retest-required
/retest-required
Had a short chat about this topic with @cahartma . There are still some open points which I need to address, so this PR will change again on Monday.
/hold
@xperimental: This pull request references LOG-5426 which is a valid jira issue.
In response to this:
Description
This PR aims to fix the issue of stale telemetry metrics present in the operator. It does this by creating a custom collector instead of relying on the standard instrumentation primitives (like
GaugeVec), which keep the previous state of the metrics.No changes are done to the metrics themselves (except for the help text) to avoid incompatibilities with previous telemetry queries.
/cc @cahartma /assign @alanconway
Known issues / ToDo
- [x] One change to the telemetry is that Loki's "default outputs" now also count into the "default output" label of the
log_forwarder_output_infometric.- [ ] What this PR does not fix (yet) is that the metrics do not disappear when the ClusterLogging instance is removed. One way to solve this would be to only emit metrics when there's at least one ClusterLogging / ClusterLogForwarder instance available.
- The implementation now checks whether the different resources are available (even if for CLF and LFME none of the metrics are based on data in the resource)
- [ ] My favorite implementation would be to have the telemetry collect all the metrics directly from the resources (for an example, see LokiStack metrics), because the
Listcall is essentially free (the list is cached locally). This would also solve the "stale metrics after delete" issue. The issue I have with implementing this approach in CLO is that I don't know how to identify the "healthy" status, for example. Input appreciated.- I now fixed this for the ClusterLogging metric, ClusterLogForwarder is still based on the old model
- [ ] It looks to me as if the LogFileMetricExporter metric is not actually ingested into telemetry. If we do not use it anywhere, it might also be an option to remove it.
- [ ] Once everything is cleared up, this needs to be squashed :slightly_smiling_face:
Links
- JIRA: LOG-5426
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.
@xperimental: This pull request references LOG-5426 which is a valid jira issue.
In response to this:
Description
This PR aims to fix the issue of stale telemetry metrics present in the operator. It does this by creating a custom collector instead of relying on the standard instrumentation primitives (like
GaugeVec), which keep the previous state of the metrics.No changes are done to the metrics themselves (except for the help text) to avoid incompatibilities with previous telemetry queries.
/cc @cahartma /assign @alanconway
Known issues / ToDo
- [x] One change to the telemetry is that Loki's "default outputs" now also count into the "default output" label of the
log_forwarder_output_infometric.- [x] What this PR does not fix (yet) is that the metrics do not disappear when the ClusterLogging instance is removed. One way to solve this would be to only emit metrics when there's at least one ClusterLogging / ClusterLogForwarder instance available.
- The implementation now checks whether the different resources are available (even if for CLF and LFME none of the metrics are based on data in the resource)
- [x] My favorite implementation would be to have the telemetry collect all the metrics directly from the resources (for an example, see LokiStack metrics), because the
Listcall is essentially free (the list is cached locally). This would also solve the "stale metrics after delete" issue. The issue I have with implementing this approach in CLO is that I don't know how to identify the "healthy" status, for example. Input appreciated.- I now fixed this for the ClusterLogging metric, ClusterLogForwarder is still based on the old model
- [ ] It looks to me as if the LogFileMetricExporter metric is not actually ingested into telemetry. If we do not use it anywhere, it might also be an option to remove it.
- [ ] Once everything is cleared up, this needs to be squashed :slightly_smiling_face:
Links
- JIRA: LOG-5426
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.
@xperimental: This pull request references LOG-5426 which is a valid jira issue.
In response to this:
Description
This PR aims to fix the issue of stale telemetry metrics present in the operator. It does this by creating a custom collector instead of relying on the standard instrumentation primitives (like
GaugeVec), which keep the previous state of the metrics.No changes are done to the metrics themselves (except for the help text) to avoid incompatibilities with previous telemetry queries.
/cc @cahartma /assign @alanconway
Known issues / ToDo
- [x] One change to the telemetry is that Loki's "default outputs" now also count into the "default output" label of the
log_forwarder_output_infometric.- [x] What this PR does not fix (yet) is that the metrics do not disappear when the ClusterLogging instance is removed. One way to solve this would be to only emit metrics when there's at least one ClusterLogging / ClusterLogForwarder instance available.
- The implementation now checks whether the different resources are available (even if for CLF and LFME none of the metrics are based on data in the resource)
- [x] My favorite implementation would be to have the telemetry collect all the metrics directly from the resources (for an example, see LokiStack metrics), because the
Listcall is essentially free (the list is cached locally). This would also solve the "stale metrics after delete" issue. The issue I have with implementing this approach in CLO is that I don't know how to identify the "healthy" status, for example. Input appreciated.- I now fixed this for the ClusterLogging metric, ClusterLogForwarder is still based on the old model
- [x] It looks to me as if the LogFileMetricExporter metric is not actually ingested into telemetry. If we do not use it anywhere, it might also be an option to remove it.
- [ ] Once everything is cleared up, this needs to be squashed :slightly_smiling_face:
Links
- JIRA: LOG-5426
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.
@xperimental: This pull request references LOG-5426 which is a valid jira issue.
In response to this:
Description
This PR aims to fix the issue of stale telemetry metrics present in the operator. It does this by creating a custom collector instead of relying on the standard instrumentation primitives (like
GaugeVec), which keep the previous state of the metrics.No changes are done to the metrics themselves (except for the help text) to avoid incompatibilities with previous telemetry queries.
/cc @cahartma /assign @alanconway
Known issues / ToDo
- [x] One change to the telemetry is that Loki's "default outputs" now also count into the "default output" label of the
log_forwarder_output_infometric.- [x] What this PR does not fix (yet) is that the metrics do not disappear when the ClusterLogging instance is removed. One way to solve this would be to only emit metrics when there's at least one ClusterLogging / ClusterLogForwarder instance available.
- The implementation now checks whether the different resources are available (even if for CLF and LFME none of the metrics are based on data in the resource)
- [x] My favorite implementation would be to have the telemetry collect all the metrics directly from the resources (for an example, see LokiStack metrics), because the
Listcall is essentially free (the list is cached locally). This would also solve the "stale metrics after delete" issue. The issue I have with implementing this approach in CLO is that I don't know how to identify the "healthy" status, for example. Input appreciated.- I now fixed this for the ClusterLogging metric, ClusterLogForwarder is still based on the old model
- [x] It looks to me as if the LogFileMetricExporter metric is not actually ingested into telemetry. If we do not use it anywhere, it might also be an option to remove it.
- [x] Once everything is cleared up, this needs to be squashed :slightly_smiling_face:
Links
- JIRA: LOG-5426
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.
Ready for review again, now already squashed as discussed :slightly_smiling_face:
It looks to me as if builds fail because of the recent change to Go 1.21 . There's already a PR to fix that issue, so I'll probably need a rebase after that one merges. :crossed_fingers:
/approve
[APPROVALNOTIFIER] This PR is APPROVED
This pull-request has been approved by: alanconway, jcantrill, xperimental
The full list of commands accepted by this bot can be found here.
The pull request process is described here
- ~~OWNERS~~ [alanconway,jcantrill,xperimental]
Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment
/unhold
@xperimental: all tests passed!
Full PR test history. Your PR dashboard.
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.