client_java icon indicating copy to clipboard operation
client_java copied to clipboard

UTF-8 support in metric and label names

Open fedetorres93 opened this issue 1 year ago • 7 comments

Adds UTF-8 support for metric and label names.

These changes are based on the work done on the Prometheus common libraries here and here

  • The prometheus-metrics-exposition-formats module will use the new quoting syntax {"foo"} iff the metric does not conform to the legacy name format (foo{})
  • The prometheus-metrics-model module has a new flag (NameValidationScheme) that determines if validation is done using the legacy or the UTF-8 scheme. This flag can be set via a property in the properties file.
  • Scrapers can announce via content negotiation that they support UTF-8 names by adding escaping=allow-utf-8 in the Accept header. In cases where UTF-8 is not available, metric providers can be configured to escape names in a few different ways: values (U__ UTF value escaping for perfect round-tripping), underscores (all invalid chars become _), dots (dots become _dot_, _ becomes __, all other values become ___). Escaping can either be a global default (PrometheusNaming.nameEscapingScheme) or can also be specified in Accept header with the escaping= term, which can be allow-utf-8 (for UTF-8-compatible), underscores, dots, or values. This should still be a noop for existing configurations because scrapers will not be passing the escaping key in the Accept header. Existing functionality is maintained.
  • The prometheus-metrics-exporter-pushgateway module will escape UTF-8 grouping keys in the URL path used when pushing metrics (see https://github.com/prometheus/pushgateway/pull/689)

Work towards https://github.com/prometheus/prometheus/issues/13095

fedetorres93 avatar Jan 09 '25 20:01 fedetorres93

Given the ongoing discussion about unit suffixes for OM 2.0 (https://github.com/prometheus/OpenMetrics/issues/286), I think we can take this UTF-8 work as a basis and then add the necessary changes to comply with the final consensus on suffixes.

fedetorres93 avatar Jan 09 '25 20:01 fedetorres93

Given the ongoing discussion about unit suffixes for OM 2.0 (prometheus/OpenMetrics#286), I think we can take this UTF-8 work as a basis and then add the necessary changes to comply with the final consensus on suffixes.

@fstab are you ok with that?

zeitlinger avatar Jan 22 '25 11:01 zeitlinger

Update: The client_java maintainers just decided that we'll wait 6 more weeks, until 1 April 2025. If we have OpenMetrics 2.0 by then we will implement that. If OpenMetrics 2.0 is still under discussion in 6 weeks we will merge this PR.

fstab avatar Feb 14 '25 15:02 fstab

@fstab Good to know, thanks for the update!

fedetorres93 avatar Feb 14 '25 15:02 fedetorres93

Hello @fstab, I just wanted to follow-up on your last comment. Seems like OM 2.0 is still under discussion, so do you think now is a good time to reconsider merging this PR?

fedetorres93 avatar Apr 10 '25 20:04 fedetorres93

We have our client_java community call tomorrow, and can discuss this there. If you have time, feel free to join. See the public Prometheus calendar linked here: https://prometheus.io/community/

fstab avatar Apr 10 '25 20:04 fstab

@fedetorres93 thanks for the PR!

let me start with some high level questions before an in-depth review:

  • I think we need a setting for https://prometheus.io/docs/guides/utf8/#otlp-metrics in ExporterOpenTelemetryProperties
  • the statics in PrometheusNaming should be final - e.g. the scrape handler should pass the escaping as an argument

zeitlinger avatar Apr 15 '25 11:04 zeitlinger

@zeitlinger Thanks for the feedback. I made the statics in PrometheusNaming final as you suggested.

About adding a setting in ExporterOpenTelemetryProperties, IIUC that module is translating from Prometheus to OTel format, so it should continue working as it is now. Prometheus' UTF-8 configs affect translations from OTel to Prometheus.

fedetorres93 avatar Jul 25 '25 20:07 fedetorres93

PrometheusNaming.nameEscapingScheme

I can't find that

zeitlinger avatar Aug 04 '25 16:08 zeitlinger

for formatting, checkstyle issues see CONTRIBUTING.md

zeitlinger avatar Aug 04 '25 17:08 zeitlinger

@zeitlinger Thank you very much for the feedback.

I've addressed your comments and updated the original PR description to reflect the current variable names, sorry for any confusion caused during your review.

fedetorres93 avatar Aug 07 '25 19:08 fedetorres93

From review with @fstab

Conceptually

  1. we should consider getting rid of the validation scheme: the java client is mostly used indirectly, e.g. from spring, JMX exporter, OTel SDK - where it's strange to have to pass a sys property to unlock UTF-8 chars in metric names and labels
  2. sanitizeMetricName should not replace UTF-8 chars anymore
  3. allow-utf-8 escaping in text formats must still escape some characters like whitespace, newlines, { - this is currently not done

Minor issues

  1. legacy is an ambiguous name - better to have nonUtf8 or similar
  2. move escaping related methods (escapeMetricSnapshot) to separate utils class

Already fixed

  1. Promethus name check now uses underscores escaping scheme

zeitlinger avatar Aug 11 '25 10:08 zeitlinger

  • legacy is an ambiguous name - better to have nonUtf8 or similar

"legacy" is the term used in upstream prometheus libraries, for better or worse. It's short for "legacy valid prometheus character set". "nonutf8" doesn't quite capture the idea either, because "." is not really a "UTF-8" character.

ywwg avatar Aug 11 '25 14:08 ywwg

allow-utf-8 escaping in text formats must still escape some characters like whitespace, newlines, { - this is currently not done

@zeitlinger AFAIU only backslashes (\) newlines (\n) and double quotes (") should be escaped, other UTF-8 characters are valid with the new quoting syntax.

fedetorres93 avatar Aug 11 '25 21:08 fedetorres93

allow-utf-8 escaping in text formats must still escape some characters like whitespace, newlines, { - this is currently not done

@zeitlinger AFAIU only backslashes (\) newlines (\n) and double quotes (") should be escaped, other UTF-8 characters are valid with the new quoting syntax.

yes, this is also the outcome of the meeting yesterday :smile:

zeitlinger avatar Aug 12 '25 07:08 zeitlinger