client_python icon indicating copy to clipboard operation
client_python copied to clipboard

Usage of Counter metrics

Open XiaosongWen opened this issue 3 years ago • 4 comments

Hi I notice, that in my different batch job, if I created the same metric, no matter i use pushadd_to_gateway or push_to_gateway, the metric will reset to 0. Is it supposed to be like that? If so, is my only option to be have a same metric name but give different label in every job?

Any suggestions is appreciated, Thanks

XiaosongWen avatar Sep 14 '22 19:09 XiaosongWen

Hello,

I believe that is the correct behavior. If it is a completely different batch job, in other words doing a different task, then you should use a different job label. If it is the same batch job just ran at a different time then Prometheus will pickup the reset for a counter and functions like rate or increase will properly work.

csmarchbanks avatar Sep 16 '22 15:09 csmarchbanks

@csmarchbanks Thanks for your respond, are you saying, if I am running the same batch job at different time, if I am doing right, I only need one counter? I am running a Tekton Pipeline, in my task, i have this code:

registry = CollectorRegistry()
counter = Counter(metric_name, metrics_help, labelnames=labels, registry=registry)
for _ in xxx:
   ...
    counter.labels(label1, labl2).inc()
pushadd_to_gateway(push_gateway_url, job=job, registry=registry, timeout=20)

So I am keep the two labels, and job the same, but it seem the counter is not incrementing in two runs. Is it because I am always creating a new Registry() every-time my job starts?

XiaosongWen avatar Sep 16 '22 21:09 XiaosongWen

I think you are doing it correctly, however counters are a bit tricky. The Prometheus explicitly does not add counters together across runs, which means what you might be seeing is that you increment the same number of times in each run so the same value is shown.

There are a couple options, the first is to reset the counter manually at some point by sending a zero value. You would either want to do this well after the batch job ran, or if the batch job takes more than a few scrape intervals to run you could reset the metrics at the beginning of the script. The second would be to use something like the Weaveworks Aggregation Gateway in order to aggregate metrics from multiple runs.

csmarchbanks avatar Sep 20 '22 15:09 csmarchbanks

My current solution is adding a group key, so for a bath job A, in different runs, it will have the same job but different instance name, e.g: {job="A", instance = "run1..n"}. And will do the aggregation, in the PQL. One if the defects is that it will create a lot of "groups" in the pustgateway, not sure if will cause any problem

XiaosongWen avatar Sep 20 '22 19:09 XiaosongWen