client_python Discrepancy between metrics names in prometheus and metadata

Hello,

When creating a Counter metric with client_python 0.4.0 - 0.16.0:

my_metric_count = Counter(name='my_metric_count',
                      documentation='My helpful description',
                      registry=MY_REGISTRY)

The metric will show in prometheus as my_metric_count_total. This is absolutely fine with the world however that same metric will show under http://prometheus:9090/api/v1/metadata as the original name:

    "my_metric_count": [
      {
        "type": "counter",
        "help": "My helpful description",
        "unit": ""
      }
    ],

If one creates the metric with '_total' suffix:

my_metric_count = Counter(name='my_metric_count_total',
                      documentation='My helpful description',
                      registry=MY_REGISTRY)

It will show in prometheus as that exact name my_metric_count_total but the name in the /api/v1/metadata will now be truncated.

    "my_metric_count": [
      {
        "type": "counter",
        "help": "My helpful description",
        "unit": ""
      }
    ],

This breaks certain input plugins with telegraf that rely on the prometheus metadata and remote write to fetch the metric type. Version of this library starting with 0.4.0 exhibit this behavior which is when metrics names began to be munged with '_total' for compatibility with OpenMetrics.

Thank you.

Feb 22 '23 23:02 sbagneris

Hmm, interesting issue. Just to make sure I am understanding correctly, I think you first example metadata payload is missing _total in the map key?

I am not sure on a solution right now. I do wonder if it is possible that telegraf is doing something odd with _count by assuming that it is a histogram. Would something like just me_metric work properly with and without the suffix?

Feb 28 '23 00:02 csmarchbanks

Hmm, interesting issue. Just to make sure I am understanding correctly, I think you first example metadata payload is missing _total in the map key?

Depends on your point of view I suppose. :-) The first example metadata is either missing the _total suffix or is simply the original name input at creation in the Python code name=my_metric_count. Either way the metric name when browsing the Prometheus UI is my_metric_count_total and when exposed via the metadata API call it is my_metric_count. This specifically breaks the Telegraf input plugin I am using because it attempts matching metrics gathered from prometheus using remote write with metrics type resolved using metadata API call. Since metrics names don't match, the input plugin is unable to resolve types from gathered metrics and drops them.

I am not sure on a solution right now. I do wonder if it is possible that telegraf is doing something odd with _count by assuming that it is a histogram. Would something like just me_metric work properly with and without the suffix?

The metric name is unimportant. me_metric will show as me_metric_total in Prom UI but me_metric in /metadata API call output. Telegraf is downstream of this and therefore not influencing anything. Wether or not Telegraf is present and gathering metrics from Prometheus does not change the outcome.

Mar 02 '23 01:03 sbagneris

Ok, I think I misunderstood your initial post a bit. This behavior is by design and is part of OpenMetrics. In OpenMetrics, the metadata name is the name of the MetricFamily, and the names you query are for Metrics belonging to the MetricFamily. See process_cpu_seconds in https://github.com/OpenObservability/OpenMetrics/blob/main/specification/OpenMetrics.md#overall-structure. This means that the name in the metadata will not include any suffixes, and is especially apparent for Histograms where you would not have different metadata for each of _bucket, _count, and _sum.

I believe that Telegraf (and any other consumers of the metadata API) need to handle cases where the suffixes for Counters are not included in the metadata. They would already need to handle that case for Histograms and Summaries.

Mar 03 '23 21:03 csmarchbanks