Add liveness probe to Celery workers
This adds a liveness probe to our workers, to help guard against the worker being "up" but not communicating with Celery.
Might help with #24731, though it'll be a pretty blunt solution.
@pingzh, very good call. Do you know of a better probe to use when it's disabled?
I'm tempted to just add an enabled flag around this feature so it can just be turned off. What do you think about that?
@pingzh, very good call. Do you know of a better probe to use when it's disabled?
I'm tempted to just add an
enabledflag around this feature so it can just be turned off. What do you think about that?
I am not aware of other better probe methods. For us, we turn off worker_enable_remote_control, it is due to that we use SQS as the message broker, which worker_enable_remote_control it creates lots of pidbox queues. It should be ok for other cases.
I like the idea of adding an enabled flag`.
Is it worth adding a note somewhere about not enabling this with SQS?
SQS is not officially supported by Airflow. We disucssed it, but Amazon team experience is that it has many more quirks and the level of support in Celery is definitely not on par with Redis/RabbitMQ so we should refrain from even stating that SQS can be used in Airflow https://github.com/apache/airflow/pull/24019.
@jedcunningham, I have enabled health checks for workers as workers not processing any messages when redis and workers communication broken. After enabling the liveness checks ended up with High memory utilization for worker pods. I have disabled the liveness checks and memory utilization fine. Could you please help on this issue.
The liveness checks are causing memory leak.
@jedcunningham, I have enabled health checks for workers as workers not processing any messages when redis and workers communication broken. After enabling the liveness checks ended up with High memory utilization for worker pods. I have disabled the liveness checks and memory utilization fine. Could you please help on this issue.
The liveness checks are causing memory leak.
I believe this is the issue with K8S livenessprobe https://github.com/kubernetes-sigs/vsphere-csi-driver/issues/778 - you can update K8S to latest version and check that the CSI livenessprobe is of the right version https://github.com/kubernetes-csi/livenessprobe/pull/94
Generally upgrading whatever K8S you are usiung to latest version is highly recommended.
Please double-check that @anu251989 and in case you observe the same issue with latest version of K8S, report it please as a new issue.