stats: Eliminate duplicate tag-name check to allow multiple regexes to populate the same tag name.
Commit Message: Currently, there can be multiple built-in regexes targeting the same tag name, and in fact there's at least one case where this occurs:
https://github.com/envoyproxy/envoy/blob/466e78586afaeb8ecb5f92351cd5ffdee3871f49/source/common/config/well_known_names.cc#L133 https://github.com/envoyproxy/envoy/blob/466e78586afaeb8ecb5f92351cd5ffdee3871f49/source/common/config/well_known_names.cc#L136
This change prevents a second tag value for a given name being from being extracted, to meet Prometheus' requirements.
Having two alternate ways of generating the same tag value allows them to be expressed using two distinct regexes, which are easier to understand, and possible for the infrastructure to optimize with the prefix-map. This situation also occurs with Istio/Wasm, which for reasons that elude me, generate stats with two very different syntaxes both meaning HTTP Response Code, and adds those extractors using configuration.
An alternate approach is to add complexity to the regex processing to allow matches in an ORed regex, which is a bit confusing, and results in regexes that cannot be optimized well by our current system. There is no one prefix that can be used to reduce the set of regexes that need to be evaluated against every stat, and the long regexes with captures are hard for humans to read. See https://github.com/envoyproxy/envoy/pull/22791
The disadvantage of allowing multiple regexes to generate the same tag, is that it may create more scenarios where a stats sink like Prometheus may be given multiple tags with the same name, and it would be good to get some notion that this is OK. Currently such cases would be rejected during process startup (for CLI-based tags) or during config processing.
I opened this up for review to initiate this discussion, but want to make sure various stakeholders have a chance to weigh in. Though no protobufs were changed structurally in this PR, it's kind of an API change (with .proto comments) and should probably be approved as one.
Additional Description: Risk Level: medium Testing: //test/... Docs Changes: changed comments in proto file that previously indicated dups were not allowed Release Notes: Platform Specific Features: Fixes: https://github.com/envoyproxy/envoy/issues/22591
As a reminder, PRs marked as draft will not be automatically assigned reviewers, or be handled by maintainer-oncall triage.
Please mark your PR as ready when you want it to be reviewed!
/retest
Retrying Azure Pipelines: Retried failed jobs in: envoy-presubmit
CC @envoyproxy/api-shepherds: Your approval is needed for changes made to (api/envoy/|docs/root/api-docs/).
envoyproxy/api-shepherds assignee is @markdroth
CC @envoyproxy/api-watchers: FYI only for changes made to (api/envoy/|docs/root/api-docs/).
@zirain FYI
Code LGTM. I think we need to differentiate the output with duplicate tags from having duplicate tag extractors. Collisions only happen if some metric happens to be extracted twice with the same tag (for the Istio case above, that should never happen). So it's OK to permit extractors, but we may need to harden the exporter if it the other side fails non-gracefully (e.g. prom crashes when seeing duplicate labels).
I think if we need to defend Prom et al from duplicate tags we should keep a set during tag extraction. It is a bit performance-sensitive so I'd prefer not to if it's not needed.
I've thought of a way to add duplicate removals that shouldn't be too expensive in terms of CPU. It would be great to hear from Prometheus experts (@ggreenway) about whether this is the right thing, or whether redundant or conflicting values would be OK to leave in.
lgtm, tanks @jmarantz
The labels in a prometheus metric MUST be unique, according to the spec.
Thanks -- in that case I think the existing state risks miss-populating Prom in one case with two ways to populate HTTP_CONN_MANAGER_PREFIX and also needs to be fixed. I'll revert to draft, and make that part of this PR before re-opening.
What do you think the behavior should be? First value wins? Last value? I'm thinking last-value where we evaluate config-based tag specifiers and command-line based tag-specifiers after the built-in ones so users can override built-in behavior.
I'd say "unspecified value wins" and increment a metric or log an error; this is a configuration error, even though it can't be detected at config load time. Defining an order limits future implementation choices.
Sounds good; thanks Greg.
@ggreenway can you take a look at the approach? It needs a doc update and another test or two, but I wanted to see if you think this is the right path before doing all that. I think the CI failure is spurious.
In this impl, the first time we do a successful tag-extraction of a stat, for a particular tag name, that will prevent the other tag extractors for that name from even running. That means that if a stat-name matches both extractors, the value will be left in the stat name.
I expect this to generally never happen, so I think it doesn't matter that much, but I made the test sort of agnostic to the order of tag extractions, and that means several possible tag-values and tag-extracted-stat-names.
/retest
Retrying Azure Pipelines: Retried failed jobs in: envoy-presubmit
/retest
Retrying Azure Pipelines: Retried failed jobs in: envoy-presubmit
ping @markdroth can you take a look at this?
@adisuissa can you do an API stamp? The proto change is just clarifying ambiguous behavior with an edited comment.