airflow icon indicating copy to clipboard operation
airflow copied to clipboard

Add liveness probe to Celery workers

Open jedcunningham opened this issue 3 years ago • 2 comments

This adds a liveness probe to our workers, to help guard against the worker being "up" but not communicating with Celery.

Might help with #24731, though it'll be a pretty blunt solution.

jedcunningham avatar Aug 05 '22 23:08 jedcunningham

@pingzh, very good call. Do you know of a better probe to use when it's disabled?

I'm tempted to just add an enabled flag around this feature so it can just be turned off. What do you think about that?

jedcunningham avatar Aug 09 '22 16:08 jedcunningham

@pingzh, very good call. Do you know of a better probe to use when it's disabled?

I'm tempted to just add an enabled flag around this feature so it can just be turned off. What do you think about that?

I am not aware of other better probe methods. For us, we turn off worker_enable_remote_control, it is due to that we use SQS as the message broker, which worker_enable_remote_control it creates lots of pidbox queues. It should be ok for other cases.

I like the idea of adding an enabled flag`.

pingzh avatar Aug 10 '22 18:08 pingzh

Is it worth adding a note somewhere about not enabling this with SQS?

SQS is not officially supported by Airflow. We disucssed it, but Amazon team experience is that it has many more quirks and the level of support in Celery is definitely not on par with Redis/RabbitMQ so we should refrain from even stating that SQS can be used in Airflow https://github.com/apache/airflow/pull/24019.

potiuk avatar Aug 26 '22 21:08 potiuk

@jedcunningham, I have enabled health checks for workers as workers not processing any messages when redis and workers communication broken. After enabling the liveness checks ended up with High memory utilization for worker pods. I have disabled the liveness checks and memory utilization fine. Could you please help on this issue.

The liveness checks are causing memory leak.

anu251989 avatar Jan 05 '23 10:01 anu251989

@jedcunningham, I have enabled health checks for workers as workers not processing any messages when redis and workers communication broken. After enabling the liveness checks ended up with High memory utilization for worker pods. I have disabled the liveness checks and memory utilization fine. Could you please help on this issue.

The liveness checks are causing memory leak.

I believe this is the issue with K8S livenessprobe https://github.com/kubernetes-sigs/vsphere-csi-driver/issues/778 - you can update K8S to latest version and check that the CSI livenessprobe is of the right version https://github.com/kubernetes-csi/livenessprobe/pull/94

Generally upgrading whatever K8S you are usiung to latest version is highly recommended.

Please double-check that @anu251989 and in case you observe the same issue with latest version of K8S, report it please as a new issue.

potiuk avatar Jan 17 '23 10:01 potiuk