Pods are still running even after receiveing SIGTERM Terminating subprocesses
Apache Airflow version
Other Airflow 2 version (please specify below)
If "Other Airflow 2 version" selected, which one?
2.8.3
What happened?
The airflow terminated the task, scheduling it for a retry. However, during the subsequent retry attempt, an error occurred indicating that the pod with identical labels still persisted. Upon inspection, I found the pods were still active from the initial attempt.
First attempt error log:
[2024-04-18, 01:30:40 IST] {local_task_job_runner.py:121} ERROR - Received SIGTERM. Terminating subprocesses
[2024-04-18, 01:30:40 IST] {process_utils.py:131} INFO - Sending 15 to group 125. PIDs of all processes in the group: [125]
[2024-04-18, 01:30:40 IST] {process_utils.py:86} INFO - Sending the signal 15 to group 125
[2024-04-18, 01:30:40 IST] {taskinstance.py:2483} ERROR - Received SIGTERM. Terminating subprocesses.
[2024-04-18, 01:30:40 IST] {taskinstance.py:2731} ERROR - Task failed with exception
Traceback (most recent call last):
File "/opt/airflow/plugins/operators/kubernetes_pod_operator.py", line 157, in execute
self.pod_manager.fetch_requested_container_logs(
File "/home/airflow/.local/lib/python3.10/site-packages/airflow/providers/cncf/kubernetes/utils/pod_manager.py", line 580, in fetch_requested_container_logs
status = self.fetch_container_logs(pod=pod, container_name=c, follow=follow_logs)
File "/home/airflow/.local/lib/python3.10/site-packages/airflow/providers/cncf/kubernetes/utils/pod_manager.py", line 503, in fetch_container_logs
last_log_time, exc = consume_logs(since_time=last_log_time)
File "/home/airflow/.local/lib/python3.10/site-packages/airflow/providers/cncf/kubernetes/utils/pod_manager.py", line 450, in consume_logs
for raw_line in logs:
File "/home/airflow/.local/lib/python3.10/site-packages/airflow/providers/cncf/kubernetes/utils/pod_manager.py", line 240, in __iter__
for data_chunk in self.response.stream(amt=None, decode_content=True):
File "/home/airflow/.local/lib/python3.10/site-packages/urllib3/response.py", line 933, in stream
yield from self.read_chunked(amt, decode_content=decode_content)
File "/home/airflow/.local/lib/python3.10/site-packages/urllib3/response.py", line 1073, in read_chunked
self._update_chunk_length()
File "/home/airflow/.local/lib/python3.10/site-packages/urllib3/response.py", line 1001, in _update_chunk_length
line = self._fp.fp.readline() # type: ignore[union-attr]
File "/usr/local/lib/python3.10/socket.py", line 705, in readinto
return self._sock.recv_into(b)
File "/usr/local/lib/python3.10/ssl.py", line 1307, in recv_into
return self.read(nbytes, buffer)
File "/usr/local/lib/python3.10/ssl.py", line 1163, in read
return self._sslobj.read(len, buffer)
File "/home/airflow/.local/lib/python3.10/site-packages/airflow/models/taskinstance.py", line 2485, in signal_handler
raise AirflowException("Task received SIGTERM signal")
airflow.exceptions.AirflowException: Task received SIGTERM signal
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/airflow/.local/lib/python3.10/site-packages/airflow/models/taskinstance.py", line 439, in _execute_task
result = _execute_callable(context=context, **execute_callable_kwargs)
File "/home/airflow/.local/lib/python3.10/site-packages/airflow/models/taskinstance.py", line 414, in _execute_callable
return execute_callable(context=context, **execute_callable_kwargs)
File "/opt/airflow/plugins/operators/kubernetes_pod_operator.py", line 184, in execute
raise AirflowException(f'Pod Launching failed: {ex}')
airflow.exceptions.AirflowException: Pod Launching failed: Task received SIGTERM signal
[2024-04-18, 01:30:40 IST] {taskinstance.py:527} DEBUG - Task Duration set to 10.25225
[2024-04-18, 01:30:40 IST] {taskinstance.py:549} DEBUG - Clearing next_method and next_kwargs.
[2024-04-18, 01:30:40 IST] {taskinstance.py:1149} INFO - Marking task as UP_FOR_RETRY.
Second attempt error log:
[2024-04-18, 01:32:20 IST] {pod.py:1109} ERROR - 'NoneType' object has no attribute 'metadata'
Traceback (most recent call last):
File "/opt/airflow/plugins/operators/kubernetes_pod_operator.py", line 153, in execute
self.remote_pod = self.find_pod(self.pod.metadata.namespace, context=context)
File "/home/airflow/.local/lib/python3.10/site-packages/airflow/providers/cncf/kubernetes/operators/pod.py", line 523, in find_pod
raise AirflowException(f"More than one pod running with labels {label_selector}")
airflow.exceptions.AirflowException: More than one pod running with labels {**** our labels *****}
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/opt/airflow/plugins/operators/kubernetes_pod_operator.py", line 184, in execute
raise AirflowException(f'Pod Launching failed: {ex}')
airflow.exceptions.AirflowException: Pod Launching failed: More than one pod running with labels {**** our labels *****}
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/airflow/.local/lib/python3.10/site-packages/airflow/providers/cncf/kubernetes/operators/pod.py", line 937, in patch_already_checked
name=pod.metadata.name,
AttributeError: 'NoneType' object has no attribute 'metadata'
[2024-04-18, 01:32:20 IST] {taskinstance.py:2731} ERROR - Task failed with exception
Traceback (most recent call last):
File "/opt/airflow/plugins/operators/kubernetes_pod_operator.py", line 153, in execute
self.remote_pod = self.find_pod(self.pod.metadata.namespace, context=context)
File "/home/airflow/.local/lib/python3.10/site-packages/airflow/providers/cncf/kubernetes/operators/pod.py", line 523, in find_pod
raise AirflowException(f"More than one pod running with labels {label_selector}")
airflow.exceptions.AirflowException: More than one pod running with labels {**** our labels *****}
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/opt/airflow/plugins/operators/kubernetes_pod_operator.py", line 184, in execute
raise AirflowException(f'Pod Launching failed: {ex}')
airflow.exceptions.AirflowException: Pod Launching failed: {**** our labels *****}
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/airflow/.local/lib/python3.10/site-packages/airflow/models/taskinstance.py", line 439, in _execute_task
result = _execute_callable(context=context, **execute_callable_kwargs)
File "/home/airflow/.local/lib/python3.10/site-packages/airflow/models/taskinstance.py", line 414, in _execute_callable
return execute_callable(context=context, **execute_callable_kwargs)
File "/opt/airflow/plugins/operators/kubernetes_pod_operator.py", line 186, in execute
self.cleanup(
File "/home/airflow/.local/lib/python3.10/site-packages/airflow/providers/cncf/kubernetes/operators/pod.py", line 839, in cleanup
raise AirflowException(
airflow.exceptions.AirflowException: Pod { *** our pod name ***} returned a failure.
remote_pod: None
[2024-04-18, 01:32:20 IST] {taskinstance.py:527} DEBUG - Task Duration set to 1.162314
[2024-04-18, 01:32:20 IST] {taskinstance.py:549} DEBUG - Clearing next_method and next_kwargs.
[2024-04-18, 01:32:20 IST] {taskinstance.py:1149} INFO - Marking task as UP_FOR_RETRY.
[2024-04-18, 01:32:20 IST] {plugins.py:178} INFO - Getting Refined Message
[2024-04-18, 01:32:20 IST] {plugins.py:180} INFO - Message Payload Not Provided
[2024-04-18, 01:32:20 IST] {logging_mixin.py:188} INFO - {'Airflow Exception': 'Pod Launching failed'}
[2024-04-18, 01:32:20 IST] {plugins.py:183} INFO - Task message: {'Airflow Exception': 'Pod Launching failed'}
What you think should happen instead?
Once the SIGTERM Terminating subprocesses is issued to the task it should properly delete the pod.
How to reproduce
Let airflow kill your task with SIGTERM and on the next retry you'll face pod already exists with same labels
Operating System
Amazon Linux 2
Versions of Apache Airflow Providers
pytest>=6.2.5 docker>=5.0.0 crypto>=1.4.1 cryptography>=3.4.7 pyOpenSSL>=20.0.1 ndg-httpsclient>=0.5.1 boto3>=1.34.0 sqlalchemy redis>=3.5.3 requests>=2.26.0 pysftp>=0.2.9 werkzeug>=1.0.1 apache-airflow-providers-cncf-kubernetes==8.0.0 apache-airflow-providers-amazon>=8.13.0 psycopg2>=2.8.5 grpcio>=1.37.1 grpcio-tools>=1.37.1 protobuf>=3.15.8,<=3.21 python-dateutil>=2.8.2 jira>=3.1.1 confluent_kafka>=1.7.0 pyarrow>=10.0.1,<10.1.0
Deployment
Official Apache Airflow Helm Chart
Deployment details
Official helm chart deployment
Anything else?
No response
Are you willing to submit PR?
- [ ] Yes I am willing to submit a PR!
Code of Conduct
- [X] I agree to follow this project's Code of Conduct
After debugging for a while , I have also found that one of our scheduler was failed with liveliness probe and restarted at the same time. So , I'm guessing that the scheduler restart caused SIGTERM killing and the task wasn't adopted by any schedulers.
Found this log in airflow scheduler:
[2024-05-03T12:09:06.476+0000] {process_utils.py:100} INFO - Sending the signal Signals.SIGTERM to process 75 as process group is missing.
[2024-05-03T12:09:05.937+0000] {process_utils.py:131} INFO - Sending Signals.SIGTERM to group 75. PIDs of all processes in the group: [42184, 42242, 42243, 42244, 42245, 42265, 75]
I believe I have identified the cause of the issue:
We are using AWS Spot EC2 instances for the workloads in Airflow. When a spot instance is terminated, the pod enters a terminating state for around 2 minutes. During the second retry, the pod is rescheduled, and the find_pod method is used to retrieve the pod based on the labels, which results in the following error:
[2024-04-18, 01:32:20 IST] {pod.py:1109} ERROR - 'NoneType' object has no attribute 'metadata'
Traceback (most recent call last):
File "/opt/airflow/plugins/operators/kubernetes_pod_operator.py", line 153, in execute
self.remote_pod = self.find_pod(self.pod.metadata.namespace, context=context)
File "/home/airflow/.local/lib/python3.10/site-packages/airflow/providers/cncf/kubernetes/operators/pod.py", line 523, in find_pod
raise AirflowException(f"More than one pod running with labels {label_selector}")
airflow.exceptions.AirflowException: More than one pod running with labels {**** our labels *****}
At this point, we have a pod in a terminating state and a new pod created by the second retry. When the cleanup method is called, it encounters another error because the find_pod method did not return anything due to the exception:
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/airflow/.local/lib/python3.10/site-packages/airflow/providers/cncf/kubernetes/operators/pod.py", line 937, in patch_already_checked
name=pod.metadata.name,
AttributeError: 'NoneType' object has no attribute 'metadata'
After every retry a new pod is created and not cleaned up which loops forever.
I hope this is solved in latest version with this PR : https://github.com/apache/airflow/pull/37671/files @dirrao , can you confirm the above PR is to solve this issue
@paramjeet01 Kubernetes executor and pod operator has gone multiple improvements in the past. I would request you try the latest versions and re-confirm the issue.
This issue has been automatically marked as stale because it has been open for 14 days with no response from the author. It will be closed in next 7 days if no further activity occurs from the issue author.
This issue has been closed because it has not received response from the issue author.