envoy icon indicating copy to clipboard operation
envoy copied to clipboard

[Question] Is it possible to provide a time_since_last_update metric in xDS subscription statistics?

Open cancecen opened this issue 3 years ago • 3 comments

I am aware of the update_time metric. My team wants to use this metric to observe if instances are receiving xDS updates in a timely fashion in our fleet and this is especially helpful in gray failure scenarios. What we want to alarm is really time_since_last_update.

However update_time being an epoch timestamp is complicating monitoring/alerting for us. I would like to think this may be a complication in other alerting system as well. For instance:

  1. We are able to monitor computed metrics in our system, however the system does not support a System.now() metric, so we cannot do a update_time - System.now() computation.
  2. After a node goes down, our metrics system cache the gauges it last reported and they continue to appear on the timeline for a while. So, it is difficult to distinguish nodes going down from nodes going stale for xDS updates due to other reasons. If we had a time_since_last_update metric instead, this wouldn't be a problem.

My main question is - does this metric look reasonable to you, or is there a reason this was not provided in the first place?

cancecen avatar Aug 11 '22 20:08 cancecen

@kyessenov since you labeled this issue - does this mean it is something that's doable and there's not a strong reason against it(in terms of current design/flow) ? Knowing that would help, and perhaps we can contribute.

cancecen avatar Aug 22 '22 21:08 cancecen

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.

github-actions[bot] avatar Sep 22 '22 00:09 github-actions[bot]

Envoy metrics have evolved over time and don't necessarily follow best modern practices. I think it's reasonable to propose usability improvements in the area of xDS status monitoring.

kyessenov avatar Sep 22 '22 03:09 kyessenov

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.

github-actions[bot] avatar Oct 22 '22 08:10 github-actions[bot]

I am working on this.

cancecen avatar Oct 22 '22 19:10 cancecen

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.

github-actions[bot] avatar Nov 22 '22 00:11 github-actions[bot]

This issue has been automatically closed because it has not had activity in the last 37 days. If this issue is still valid, please ping a maintainer and ask them to label it as "help wanted" or "no stalebot". Thank you for your contributions.

github-actions[bot] avatar Nov 29 '22 00:11 github-actions[bot]