EmrContainerOperator in Async mode doesn't respect default "infinite" polling number
Apache Airflow Provider(s)
amazon
Versions of Apache Airflow Providers
apache-airflow-providers-amazon[aiobotocore]==8.24.0
Apache Airflow version
2.7.3
Operating System
"Debian GNU/Linux 11 (bullseye)"
Deployment
Official Apache Airflow Helm Chart
Deployment details
Deployment to EKS
What happened
EMR EKS job timedout unexpectedly with error (EMRContainerOperator) when used in deferred mode:
File "/home/airflow/.local/lib/python3.10/site-packages/airflow/providers/amazon/aws/utils/waiter_with_logging.py", line 133, in async_wait
raise AirflowException("Waiter error: max attempts reached")
airflow.exceptions.AirflowException: Waiter error: max attempts reached
While not providing any max_attempts
What you think should happen instead
The job should poll until it becomes FAILD or SUCCESSFUL
How to reproduce
Trigger long running job (over 5 hrs) using EMRContainerOperator in Async/Deferred mode
Anything else
I believe it's caused by the defaults defined here: https://github.com/apache/airflow/blob/6c12744dd8656e1d8b066c7edc8f0ab60ac124d2/airflow/providers/amazon/aws/triggers/emr.py#L185-L186
This contradicts documentation: https://airflow.apache.org/docs/apache-airflow-providers-amazon/stable/_api/airflow/providers/amazon/aws/operators/emr/index.html#airflow.providers.amazon.aws.operators.emr.EmrContainerOperator
Which outlines: max_polling_attempts (int | None) – Maximum number of times to wait for the job run to finish. Defaults to None, which will poll until the job is not in a pending, submitted, or running state.
Which doesn't seem to be the case and hence raising this as an Issue.
Are you willing to submit PR?
- [X] Yes I am willing to submit a PR!
Code of Conduct
- [X] I agree to follow this project's Code of Conduct
Hey @akomisarek , are you working on this issue or is it open for PR ?
Hi @STAR-173 - no I haven't started working on this yet, would be only able to pick this up Tuesday/Wednesday, so if you can work on this earlier, feel free to pick it up. Thanks! :)
Sorry, probably I should add follow up comment, that for time being I ditched the idea of Async operator, as I also hit that problem: https://github.com/apache/airflow/issues/36090
So for now I moved back to Sync approach, and noticed it works even better as it prints out logs, which often works well. So won't contribute for the time being :(