nats-queue-worker Feature: add healthcheck

Expected Behaviour

Healthcheck over HTTP or an exec probe which can be used by Kubernetes to check readiness and health

Current Behaviour

N/a

Possible Solution

Please suggest one of the options above, or see how other projects are doing this and report back.

Context

A health-check can help with robustness.

Jul 06 '19 10:07 alexellis

See also: https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-probes/

https://github.com/openfaas/faas-netes/blob/master/chart/openfaas/templates/gateway-dep.yaml#L45

Jul 06 '19 10:07 alexellis

@matthiashanel what are your thoughts on this?

May 12 '20 12:05 alexellis

@alexellis, I can see how adding a HTTP endpoint makes sense if the service itself serves HTTP. There you'd get some feedback e.g. your service slows down and so would the health check endpoint. In the queue worker this would be largely unrelated, so I don't quite see the benefit justifying the added complexity. As for readiness, if connect fails the program will exit, causing a restart. When this happens messages will continue to be stored streaming.

Did you run into a concrete problem where this could help?

May 13 '20 06:05 matthiashanel

Most Kubernetes services should have a way to express health and readiness via an exec, TCP, or HTTP probe. This can be used for a number of things including decisions about scaling or recovery.

If we're fairly sure that this is not required when interacting with NATS then I'll close it out.

I wonder if there is any value in exploring metrics instrumentation of the queue-worker itself, or if the metrics in the gateway and NATS itself are enough to get a good picture of things?

May 13 '20 13:05 alexellis

health probe: The best value I can imagine the queue worker to produce is how many messages it currently processes. A value of 5 says little about wether scaling is needed or not. Scaling is needed if there are too many messages the service has not seem.

Readiness probe: The queue worker does not open a port or serve HTTP, which makes a readiness probe a tough nut to crack. Ready for the queue worker essentially means the nats connection got established. If that does not work the queue worker exits. I can imagine conditions where the streaming client does not return from connect. Starting a webserver to protect against this by indicating readiness seems even more complex. Do I make sense here?

We will get a lot more mileage by using the metrics nats has. In the nats-streaming-server what would have to happen is opening the monitoring port -m <port>

This example shows how to discover channels and inspect them via curl

nats-streaming-server -m 8080
curl http://127.0.0.1:8080/streaming/channelsz
{
  "cluster_id": "test-cluster",
  "server_id": "ZAs0tFNCNAd5CZuEm0I0xA",
  "now": "2020-05-13T14:49:04.556034-04:00",
  "offset": 0,
  "limit": 1024,
  "count": 2,
  "total": 2,
  "names": [
    "queue",
    "foo"
  ]
}
curl http://127.0.0.1:8080/streaming/channelsz\?channel\=queue
{
  "name": "queue",
  "msgs": 1,
  "bytes": 22,
  "first_seq": 1,
  "last_seq": 1
}%
# this one also returns information about subscriber
curl http://127.0.0.1:8080/streaming/channelsz\?channel\=foo&subs=1

https://docs.nats.io/nats-streaming-concepts/monitoring#monitoring-a-nats-streaming-channel-with-grafana-and-prometheus

May 13 '20 18:05 matthiashanel