Fetch served logs also when task attempt is up for retry and no remote logs available
This PR is a continuation of #39177, as while testing it noticed that when the task's state is in TaskInstanceState.UP_FOR_RETRY, the logs for all attempts is unavailable, when the source is served logs.
Also noticed that the tests previously only covered the TaskInstanceState.RUNNING situation, whereas there was also TaskInstanceState.DEFERRED (and now TaskInstanceState.UP_FOR_RETRY). Changed the test to cover all three cases.
^ Add meaningful description above
Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named {pr_number}.significant.rst or {issue_number}.significant.rst, in newsfragments.
@kahlstrm can you fix the failing tests?
@kahlstrm can you fix the failing tests?
Sure, didn't realize the tests failing are related ones, one sec.
@eladkal fixed the tests, seems like the odd behavior with TaskInstanceState.DEFERRED was fixed, and me working around it in these tests were broken due to that.
I tried to get this into 2.9.2 but the test was failing https://github.com/apache/airflow/actions/runs/9383343196/job/25837180680?pr=40050#step:7:3547
I tried to get this into 2.9.2 but the test was failing https://github.com/apache/airflow/actions/runs/9383343196/job/25837180680?pr=40050#step:7:3547
Hmm yeah, I'd guess this has something to do with the flakiness with the TaskInstanceState.DEFERRED-state, as I encountered something similar while writing these tests. I can try to have a second look at some point.
Filed a PR that simply removes the inconsistent test case for TaskInstanceState.DEFERRED, it can be found here https://github.com/apache/airflow/pull/40257 @ephraimbuddy
@kahlstrm there are some issues with this code now that i'm looking into.
But here's a question for you. Why should we check served logs when the task is up for retry? I think it's reasonable to assume that when the task is in a running state, that the logs are where they are going to be -- either shared volume, or remote storage.
@kahlstrm there are some issues with this code now that i'm looking into.
But here's a question for you. Why should we check served logs when the task is up for retry? I think it's reasonable to assume that when the task is in a running state, that the logs are where they are going to be -- either shared volume, or remote storage.
The issue we encountered is that when the state of the task instance state was up for retry, i.e. previous attempt had failed, the logs for the previous attempt weren't available for the time period when the task instance was in up for retry state.
Example:
Attempt 1 fails for some reason which is easiest visible in the logs. New Task instance is marked up for retry. During the time task instance is in up for retry state, the previous logs for the previous task attempts are not being fetched. The way to access the logs for that period of time is to ssh into the worker and find the log file.
Does this use case make sense? We started only observing these problems after upgrading to 2.7.0 and higher.
The issue we encountered is that when the state of the task instance state was up for retry, i.e. previous attempt had failed, the logs for the previous attempt weren't available for the time period when the task instance was in up for retry state.
This doesn't sound right. When a task fails (whether there are remaing retries or not), the logs should be wherever they are going to be. Does your webserver have access to the shared volume?