envoy icon indicating copy to clipboard operation
envoy copied to clipboard

stats: Eliminate duplicate tag-name check to allow multiple regexes to populate the same tag name.

Open jmarantz opened this issue 3 years ago • 14 comments

Commit Message: Currently, there can be multiple built-in regexes targeting the same tag name, and in fact there's at least one case where this occurs:

https://github.com/envoyproxy/envoy/blob/466e78586afaeb8ecb5f92351cd5ffdee3871f49/source/common/config/well_known_names.cc#L133 https://github.com/envoyproxy/envoy/blob/466e78586afaeb8ecb5f92351cd5ffdee3871f49/source/common/config/well_known_names.cc#L136

This change prevents a second tag value for a given name being from being extracted, to meet Prometheus' requirements.

Having two alternate ways of generating the same tag value allows them to be expressed using two distinct regexes, which are easier to understand, and possible for the infrastructure to optimize with the prefix-map. This situation also occurs with Istio/Wasm, which for reasons that elude me, generate stats with two very different syntaxes both meaning HTTP Response Code, and adds those extractors using configuration.

An alternate approach is to add complexity to the regex processing to allow matches in an ORed regex, which is a bit confusing, and results in regexes that cannot be optimized well by our current system. There is no one prefix that can be used to reduce the set of regexes that need to be evaluated against every stat, and the long regexes with captures are hard for humans to read. See https://github.com/envoyproxy/envoy/pull/22791

The disadvantage of allowing multiple regexes to generate the same tag, is that it may create more scenarios where a stats sink like Prometheus may be given multiple tags with the same name, and it would be good to get some notion that this is OK. Currently such cases would be rejected during process startup (for CLI-based tags) or during config processing.

I opened this up for review to initiate this discussion, but want to make sure various stakeholders have a chance to weigh in. Though no protobufs were changed structurally in this PR, it's kind of an API change (with .proto comments) and should probably be approved as one.

Additional Description: Risk Level: medium Testing: //test/... Docs Changes: changed comments in proto file that previously indicated dups were not allowed Release Notes: Platform Specific Features: Fixes: https://github.com/envoyproxy/envoy/issues/22591

jmarantz avatar Sep 07 '22 17:09 jmarantz

As a reminder, PRs marked as draft will not be automatically assigned reviewers, or be handled by maintainer-oncall triage.

Please mark your PR as ready when you want it to be reviewed!

:cat:

Caused by: https://github.com/envoyproxy/envoy/pull/23024 was opened by jmarantz.

see: more, trace.

/retest

jmarantz avatar Sep 08 '22 00:09 jmarantz

Retrying Azure Pipelines: Retried failed jobs in: envoy-presubmit

:cat:

Caused by: a https://github.com/envoyproxy/envoy/pull/23024#issuecomment-1240063877 was created by @jmarantz.

see: more, trace.

CC @envoyproxy/api-shepherds: Your approval is needed for changes made to (api/envoy/|docs/root/api-docs/). envoyproxy/api-shepherds assignee is @markdroth CC @envoyproxy/api-watchers: FYI only for changes made to (api/envoy/|docs/root/api-docs/).

:cat:

Caused by: https://github.com/envoyproxy/envoy/pull/23024 was synchronize by jmarantz.

see: more, trace.

@zirain FYI

jmarantz avatar Sep 08 '22 20:09 jmarantz

Code LGTM. I think we need to differentiate the output with duplicate tags from having duplicate tag extractors. Collisions only happen if some metric happens to be extracted twice with the same tag (for the Istio case above, that should never happen). So it's OK to permit extractors, but we may need to harden the exporter if it the other side fails non-gracefully (e.g. prom crashes when seeing duplicate labels).

kyessenov avatar Sep 09 '22 17:09 kyessenov

I think if we need to defend Prom et al from duplicate tags we should keep a set during tag extraction. It is a bit performance-sensitive so I'd prefer not to if it's not needed.

jmarantz avatar Sep 09 '22 17:09 jmarantz

I've thought of a way to add duplicate removals that shouldn't be too expensive in terms of CPU. It would be great to hear from Prometheus experts (@ggreenway) about whether this is the right thing, or whether redundant or conflicting values would be OK to leave in.

jmarantz avatar Sep 11 '22 14:09 jmarantz

lgtm, tanks @jmarantz

zirain avatar Sep 12 '22 12:09 zirain

The labels in a prometheus metric MUST be unique, according to the spec.

ggreenway avatar Sep 12 '22 15:09 ggreenway

Thanks -- in that case I think the existing state risks miss-populating Prom in one case with two ways to populate HTTP_CONN_MANAGER_PREFIX and also needs to be fixed. I'll revert to draft, and make that part of this PR before re-opening.

What do you think the behavior should be? First value wins? Last value? I'm thinking last-value where we evaluate config-based tag specifiers and command-line based tag-specifiers after the built-in ones so users can override built-in behavior.

jmarantz avatar Sep 12 '22 16:09 jmarantz

I'd say "unspecified value wins" and increment a metric or log an error; this is a configuration error, even though it can't be detected at config load time. Defining an order limits future implementation choices.

ggreenway avatar Sep 12 '22 16:09 ggreenway

Sounds good; thanks Greg.

jmarantz avatar Sep 12 '22 16:09 jmarantz

@ggreenway can you take a look at the approach? It needs a doc update and another test or two, but I wanted to see if you think this is the right path before doing all that. I think the CI failure is spurious.

In this impl, the first time we do a successful tag-extraction of a stat, for a particular tag name, that will prevent the other tag extractors for that name from even running. That means that if a stat-name matches both extractors, the value will be left in the stat name.

I expect this to generally never happen, so I think it doesn't matter that much, but I made the test sort of agnostic to the order of tag extractions, and that means several possible tag-values and tag-extracted-stat-names.

jmarantz avatar Sep 16 '22 04:09 jmarantz

/retest

jmarantz avatar Sep 24 '22 09:09 jmarantz

Retrying Azure Pipelines: Retried failed jobs in: envoy-presubmit

:cat:

Caused by: a https://github.com/envoyproxy/envoy/pull/23024#issuecomment-1256916417 was created by @jmarantz.

see: more, trace.

/retest

jmarantz avatar Sep 24 '22 16:09 jmarantz

Retrying Azure Pipelines: Retried failed jobs in: envoy-presubmit

:cat:

Caused by: a https://github.com/envoyproxy/envoy/pull/23024#issuecomment-1257011151 was created by @jmarantz.

see: more, trace.

ping @markdroth can you take a look at this?

zirain avatar Sep 27 '22 02:09 zirain

@adisuissa can you do an API stamp? The proto change is just clarifying ambiguous behavior with an edited comment.

jmarantz avatar Sep 28 '22 16:09 jmarantz