MAINT status should not return 0
When a frontend or backend is set to maintenance mode, is is down on purpose and should not return a fail state to prometheus/grafana by returning 0. Instead I suggest a returncode of 2 so grafana can give the status a special color in case a loadbalancer is equipped with such a setting.
We've tried changing the return value in the code but that didn't seem to do the trick.
In order to avoid complexity to understand the different code status, I would propose to rename the up metric to something like:
status{status="up", backend="my_backend_name"} 42
status{status="down", backend="my_backend_name"} 5
status{status="maint", backend="my_backend_name"} 18
WDYT @grobie?
Strongly advise this change
waiting for @grobie acknowledgment before starting any patch
I wouldn't change the current up metric, it's being used in many dashboards and alert expressions. I'd be fine adding an additional metric, but I'm a bit worried about the label cardinality. I'm counting at least 9 different status values. Maybe only add that metric to backend and frontend lines?
@grobie the MAINT status is only avaiable in the backend. At least the status MAINT shouldn't return 0 because a planned maitenance by definition is not an error nor an unwanted condition. Normally the status page of haproxy only has the status UP,DOWN,MAINT and DRAIN (see: Link).
Also drain is a forced action which is wanted by the administrators of the proxy and thus should not be considered as a failure either. So generally spoken, UP and DOWN are conditions which could be met without any action (e.g. in an error case) while DRAIN and MAINT are forced actions which should be treated differently. Hence my suggestion to return 2 and expand the haproxy metrics.
ok I can add what I proposed
The up metric is a common pattern in Prometheus and is a boolean value with the values 0 or 1. An instance / a server which can't serve requests is not up, whether it's not up because of errors or maintenance is not being answered by this metric. If that destinction is relevant to users, I'm happy to accept a PR which will add a new metric broken down by status type.
I would very much like the status to be exposed, particularly per server, few thoughts (partly echoing what has already been said):
- "returncode of 2" would be meaningless once you start summing across metrics.
- Not (numerically) counting MAINT as down is dangerous. Let's say I alert when I have less than 5 servers ready. I normally have 10 servers in the backend, I take 2 down for maintenance (MAINT), then 4 of the remaining servers fail. If MAINT isn't being counted as down I will not get an alert.
I would support leaving the current up metric as is, and creating a new metric e.g:
haproxy_server_status{status="MAINT",backend="foo",instance="bar",job="haproxy",server="server1"} 1
Transitional statuses could perhaps be normalised, e.g "UP 1/3" and "UP 2/3" become "UP".
So rather than changing the metric, adding a new attribute is the right way to go so users can decide whether to work with that new information being parsed or not without breaking anything they've created so far.
Unfortunately I'm not familiar with the go-syntax and although I understand parts of it I think it's better someone does the pull request who knows what he's doing. Th suggestion of @Tom-Fawcett seems to point in the right direction.
Any plans on implementing this one in the near future?
@jzielke-nli I have opened #101. Depending on feedback it may require some additional work.
@grobie Please be so kind and commit this change if ok.
With MAINT of a server being configured as a new metric, in this threads example, how would that work with Grafana where you have a single panel for status of a server?
Any update on this? @grobie
Dead end here?
Any update on this one ?
Hi, I am closing this issue because we are retiring this exporter. We will not be implementing new features anymore.
Please use the Prometheus support in HAProxy directly. It may already support this; if not, please open an issue against the HAProxy repository.