airflow Pods are still running even after receiveing SIGTERM Terminating subprocesses

Apache Airflow version

Other Airflow 2 version (please specify below)

If "Other Airflow 2 version" selected, which one?

2.8.3

What happened?

The airflow terminated the task, scheduling it for a retry. However, during the subsequent retry attempt, an error occurred indicating that the pod with identical labels still persisted. Upon inspection, I found the pods were still active from the initial attempt.

First attempt error log:

[2024-04-18, 01:30:40 IST] {local_task_job_runner.py:121} ERROR - Received SIGTERM. Terminating subprocesses
[2024-04-18, 01:30:40 IST] {process_utils.py:131} INFO - Sending 15 to group 125. PIDs of all processes in the group: [125]
[2024-04-18, 01:30:40 IST] {process_utils.py:86} INFO - Sending the signal 15 to group 125
[2024-04-18, 01:30:40 IST] {taskinstance.py:2483} ERROR - Received SIGTERM. Terminating subprocesses.
[2024-04-18, 01:30:40 IST] {taskinstance.py:2731} ERROR - Task failed with exception
Traceback (most recent call last):
  File "/opt/airflow/plugins/operators/kubernetes_pod_operator.py", line 157, in execute
    self.pod_manager.fetch_requested_container_logs(
  File "/home/airflow/.local/lib/python3.10/site-packages/airflow/providers/cncf/kubernetes/utils/pod_manager.py", line 580, in fetch_requested_container_logs
    status = self.fetch_container_logs(pod=pod, container_name=c, follow=follow_logs)
  File "/home/airflow/.local/lib/python3.10/site-packages/airflow/providers/cncf/kubernetes/utils/pod_manager.py", line 503, in fetch_container_logs
    last_log_time, exc = consume_logs(since_time=last_log_time)
  File "/home/airflow/.local/lib/python3.10/site-packages/airflow/providers/cncf/kubernetes/utils/pod_manager.py", line 450, in consume_logs
    for raw_line in logs:
  File "/home/airflow/.local/lib/python3.10/site-packages/airflow/providers/cncf/kubernetes/utils/pod_manager.py", line 240, in __iter__
    for data_chunk in self.response.stream(amt=None, decode_content=True):
  File "/home/airflow/.local/lib/python3.10/site-packages/urllib3/response.py", line 933, in stream
    yield from self.read_chunked(amt, decode_content=decode_content)
  File "/home/airflow/.local/lib/python3.10/site-packages/urllib3/response.py", line 1073, in read_chunked
    self._update_chunk_length()
  File "/home/airflow/.local/lib/python3.10/site-packages/urllib3/response.py", line 1001, in _update_chunk_length
    line = self._fp.fp.readline()  # type: ignore[union-attr]
  File "/usr/local/lib/python3.10/socket.py", line 705, in readinto
    return self._sock.recv_into(b)
  File "/usr/local/lib/python3.10/ssl.py", line 1307, in recv_into
    return self.read(nbytes, buffer)
  File "/usr/local/lib/python3.10/ssl.py", line 1163, in read
    return self._sslobj.read(len, buffer)
  File "/home/airflow/.local/lib/python3.10/site-packages/airflow/models/taskinstance.py", line 2485, in signal_handler
    raise AirflowException("Task received SIGTERM signal")
airflow.exceptions.AirflowException: Task received SIGTERM signal
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "/home/airflow/.local/lib/python3.10/site-packages/airflow/models/taskinstance.py", line 439, in _execute_task
    result = _execute_callable(context=context, **execute_callable_kwargs)
  File "/home/airflow/.local/lib/python3.10/site-packages/airflow/models/taskinstance.py", line 414, in _execute_callable
    return execute_callable(context=context, **execute_callable_kwargs)
  File "/opt/airflow/plugins/operators/kubernetes_pod_operator.py", line 184, in execute
    raise AirflowException(f'Pod Launching failed: {ex}')
airflow.exceptions.AirflowException: Pod Launching failed: Task received SIGTERM signal
[2024-04-18, 01:30:40 IST] {taskinstance.py:527} DEBUG - Task Duration set to 10.25225
[2024-04-18, 01:30:40 IST] {taskinstance.py:549} DEBUG - Clearing next_method and next_kwargs.
[2024-04-18, 01:30:40 IST] {taskinstance.py:1149} INFO - Marking task as UP_FOR_RETRY.

Second attempt error log:

[2024-04-18, 01:32:20 IST] {pod.py:1109} ERROR - 'NoneType' object has no attribute 'metadata'
Traceback (most recent call last):
  File "/opt/airflow/plugins/operators/kubernetes_pod_operator.py", line 153, in execute
    self.remote_pod = self.find_pod(self.pod.metadata.namespace, context=context)
  File "/home/airflow/.local/lib/python3.10/site-packages/airflow/providers/cncf/kubernetes/operators/pod.py", line 523, in find_pod
    raise AirflowException(f"More than one pod running with labels {label_selector}")
airflow.exceptions.AirflowException: More than one pod running with labels {**** our labels *****}
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "/opt/airflow/plugins/operators/kubernetes_pod_operator.py", line 184, in execute
    raise AirflowException(f'Pod Launching failed: {ex}')
airflow.exceptions.AirflowException: Pod Launching failed: More than one pod running with labels {**** our labels *****}
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "/home/airflow/.local/lib/python3.10/site-packages/airflow/providers/cncf/kubernetes/operators/pod.py", line 937, in patch_already_checked
    name=pod.metadata.name,
AttributeError: 'NoneType' object has no attribute 'metadata'
[2024-04-18, 01:32:20 IST] {taskinstance.py:2731} ERROR - Task failed with exception
Traceback (most recent call last):
  File "/opt/airflow/plugins/operators/kubernetes_pod_operator.py", line 153, in execute
    self.remote_pod = self.find_pod(self.pod.metadata.namespace, context=context)
  File "/home/airflow/.local/lib/python3.10/site-packages/airflow/providers/cncf/kubernetes/operators/pod.py", line 523, in find_pod
    raise AirflowException(f"More than one pod running with labels {label_selector}")
airflow.exceptions.AirflowException: More than one pod running with labels {**** our labels *****}
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "/opt/airflow/plugins/operators/kubernetes_pod_operator.py", line 184, in execute
    raise AirflowException(f'Pod Launching failed: {ex}')
airflow.exceptions.AirflowException: Pod Launching failed: {**** our labels *****}
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "/home/airflow/.local/lib/python3.10/site-packages/airflow/models/taskinstance.py", line 439, in _execute_task
    result = _execute_callable(context=context, **execute_callable_kwargs)
  File "/home/airflow/.local/lib/python3.10/site-packages/airflow/models/taskinstance.py", line 414, in _execute_callable
    return execute_callable(context=context, **execute_callable_kwargs)
  File "/opt/airflow/plugins/operators/kubernetes_pod_operator.py", line 186, in execute
    self.cleanup(
  File "/home/airflow/.local/lib/python3.10/site-packages/airflow/providers/cncf/kubernetes/operators/pod.py", line 839, in cleanup
    raise AirflowException(
airflow.exceptions.AirflowException: Pod { *** our pod name ***} returned a failure.
remote_pod: None
[2024-04-18, 01:32:20 IST] {taskinstance.py:527} DEBUG - Task Duration set to 1.162314
[2024-04-18, 01:32:20 IST] {taskinstance.py:549} DEBUG - Clearing next_method and next_kwargs.
[2024-04-18, 01:32:20 IST] {taskinstance.py:1149} INFO - Marking task as UP_FOR_RETRY. 
[2024-04-18, 01:32:20 IST] {plugins.py:178} INFO - Getting Refined Message
[2024-04-18, 01:32:20 IST] {plugins.py:180} INFO - Message Payload Not Provided
[2024-04-18, 01:32:20 IST] {logging_mixin.py:188} INFO - {'Airflow Exception': 'Pod Launching failed'}
[2024-04-18, 01:32:20 IST] {plugins.py:183} INFO - Task message: {'Airflow Exception': 'Pod Launching failed'}

What you think should happen instead?

Once the SIGTERM Terminating subprocesses is issued to the task it should properly delete the pod.

How to reproduce

Let airflow kill your task with SIGTERM and on the next retry you'll face pod already exists with same labels

Operating System

Amazon Linux 2

Versions of Apache Airflow Providers

pytest>=6.2.5 docker>=5.0.0 crypto>=1.4.1 cryptography>=3.4.7 pyOpenSSL>=20.0.1 ndg-httpsclient>=0.5.1 boto3>=1.34.0 sqlalchemy redis>=3.5.3 requests>=2.26.0 pysftp>=0.2.9 werkzeug>=1.0.1 apache-airflow-providers-cncf-kubernetes==8.0.0 apache-airflow-providers-amazon>=8.13.0 psycopg2>=2.8.5 grpcio>=1.37.1 grpcio-tools>=1.37.1 protobuf>=3.15.8,<=3.21 python-dateutil>=2.8.2 jira>=3.1.1 confluent_kafka>=1.7.0 pyarrow>=10.0.1,<10.1.0

Deployment

Official Apache Airflow Helm Chart

Deployment details

Official helm chart deployment

Anything else?

No response

Are you willing to submit PR?

[ ] Yes I am willing to submit a PR!

Code of Conduct

[X] I agree to follow this project's Code of Conduct

Apr 17 '24 22:04 paramjeet01

After debugging for a while , I have also found that one of our scheduler was failed with liveliness probe and restarted at the same time. So , I'm guessing that the scheduler restart caused SIGTERM killing and the task wasn't adopted by any schedulers.

Apr 21 '24 15:04 paramjeet01

Found this log in airflow scheduler:

[2024-05-03T12:09:06.476+0000] {process_utils.py:100} INFO - Sending the signal Signals.SIGTERM to process 75 as process group is missing.
[2024-05-03T12:09:05.937+0000] {process_utils.py:131} INFO - Sending Signals.SIGTERM to group 75. PIDs of all processes in the group: [42184, 42242, 42243, 42244, 42245, 42265, 75]

May 03 '24 15:05 paramjeet01

I believe I have identified the cause of the issue:

We are using AWS Spot EC2 instances for the workloads in Airflow. When a spot instance is terminated, the pod enters a terminating state for around 2 minutes. During the second retry, the pod is rescheduled, and the find_pod method is used to retrieve the pod based on the labels, which results in the following error:

[2024-04-18, 01:32:20 IST] {pod.py:1109} ERROR - 'NoneType' object has no attribute 'metadata'
Traceback (most recent call last):
  File "/opt/airflow/plugins/operators/kubernetes_pod_operator.py", line 153, in execute
    self.remote_pod = self.find_pod(self.pod.metadata.namespace, context=context)
  File "/home/airflow/.local/lib/python3.10/site-packages/airflow/providers/cncf/kubernetes/operators/pod.py", line 523, in find_pod
    raise AirflowException(f"More than one pod running with labels {label_selector}")
airflow.exceptions.AirflowException: More than one pod running with labels {**** our labels *****}

At this point, we have a pod in a terminating state and a new pod created by the second retry. When the cleanup method is called, it encounters another error because the find_pod method did not return anything due to the exception:

During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "/home/airflow/.local/lib/python3.10/site-packages/airflow/providers/cncf/kubernetes/operators/pod.py", line 937, in patch_already_checked
    name=pod.metadata.name,
AttributeError: 'NoneType' object has no attribute 'metadata'

After every retry a new pod is created and not cleaned up which loops forever.

May 18 '24 09:05 paramjeet01

I hope this is solved in latest version with this PR : https://github.com/apache/airflow/pull/37671/files @dirrao , can you confirm the above PR is to solve this issue

May 18 '24 13:05 paramjeet01

@paramjeet01 Kubernetes executor and pod operator has gone multiple improvements in the past. I would request you try the latest versions and re-confirm the issue.

May 19 '24 05:05 dirrao

This issue has been automatically marked as stale because it has been open for 14 days with no response from the author. It will be closed in next 7 days if no further activity occurs from the issue author.

Jun 18 '24 00:06 github-actions[bot]

This issue has been closed because it has not received response from the issue author.

Jun 28 '24 00:06 github-actions[bot]