airflow icon indicating copy to clipboard operation
airflow copied to clipboard

EmrContainerOperator in Async mode doesn't respect default "infinite" polling number

Open akomisarek opened this issue 1 year ago • 2 comments

Apache Airflow Provider(s)

amazon

Versions of Apache Airflow Providers

apache-airflow-providers-amazon[aiobotocore]==8.24.0

Apache Airflow version

2.7.3

Operating System

"Debian GNU/Linux 11 (bullseye)"

Deployment

Official Apache Airflow Helm Chart

Deployment details

Deployment to EKS

What happened

EMR EKS job timedout unexpectedly with error (EMRContainerOperator) when used in deferred mode:

  File "/home/airflow/.local/lib/python3.10/site-packages/airflow/providers/amazon/aws/utils/waiter_with_logging.py", line 133, in async_wait
    raise AirflowException("Waiter error: max attempts reached")
airflow.exceptions.AirflowException: Waiter error: max attempts reached

While not providing any max_attempts

What you think should happen instead

The job should poll until it becomes FAILD or SUCCESSFUL

How to reproduce

Trigger long running job (over 5 hrs) using EMRContainerOperator in Async/Deferred mode

Anything else

I believe it's caused by the defaults defined here: https://github.com/apache/airflow/blob/6c12744dd8656e1d8b066c7edc8f0ab60ac124d2/airflow/providers/amazon/aws/triggers/emr.py#L185-L186

This contradicts documentation: https://airflow.apache.org/docs/apache-airflow-providers-amazon/stable/_api/airflow/providers/amazon/aws/operators/emr/index.html#airflow.providers.amazon.aws.operators.emr.EmrContainerOperator

Which outlines: max_polling_attempts (int | None) – Maximum number of times to wait for the job run to finish. Defaults to None, which will poll until the job is not in a pending, submitted, or running state.

Which doesn't seem to be the case and hence raising this as an Issue.

Are you willing to submit PR?

  • [X] Yes I am willing to submit a PR!

Code of Conduct

akomisarek avatar Jun 28 '24 18:06 akomisarek

Hey @akomisarek , are you working on this issue or is it open for PR ?

STAR-173 avatar Jun 30 '24 19:06 STAR-173

Hi @STAR-173 - no I haven't started working on this yet, would be only able to pick this up Tuesday/Wednesday, so if you can work on this earlier, feel free to pick it up. Thanks! :)

akomisarek avatar Jun 30 '24 20:06 akomisarek

Sorry, probably I should add follow up comment, that for time being I ditched the idea of Async operator, as I also hit that problem: https://github.com/apache/airflow/issues/36090

So for now I moved back to Sync approach, and noticed it works even better as it prints out logs, which often works well. So won't contribute for the time being :(

akomisarek avatar Jul 24 '24 09:07 akomisarek