airflow icon indicating copy to clipboard operation
airflow copied to clipboard

Fetch served logs also when task attempt is up for retry and no remote logs available

Open kahlstrm opened this issue 1 year ago • 3 comments

This PR is a continuation of #39177, as while testing it noticed that when the task's state is in TaskInstanceState.UP_FOR_RETRY, the logs for all attempts is unavailable, when the source is served logs.

Also noticed that the tests previously only covered the TaskInstanceState.RUNNING situation, whereas there was also TaskInstanceState.DEFERRED (and now TaskInstanceState.UP_FOR_RETRY). Changed the test to cover all three cases.

^ Add meaningful description above Read the Pull Request Guidelines for more information. In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed. In case of a new dependency, check compliance with the ASF 3rd Party License Policy. In case of backwards incompatible changes please leave a note in a newsfragment file, named {pr_number}.significant.rst or {issue_number}.significant.rst, in newsfragments.

kahlstrm avatar May 08 '24 17:05 kahlstrm

@kahlstrm can you fix the failing tests?

eladkal avatar May 20 '24 15:05 eladkal

@kahlstrm can you fix the failing tests?

Sure, didn't realize the tests failing are related ones, one sec.

kahlstrm avatar May 20 '24 16:05 kahlstrm

@eladkal fixed the tests, seems like the odd behavior with TaskInstanceState.DEFERRED was fixed, and me working around it in these tests were broken due to that.

kahlstrm avatar May 20 '24 19:05 kahlstrm

I tried to get this into 2.9.2 but the test was failing https://github.com/apache/airflow/actions/runs/9383343196/job/25837180680?pr=40050#step:7:3547

ephraimbuddy avatar Jun 05 '24 12:06 ephraimbuddy

I tried to get this into 2.9.2 but the test was failing https://github.com/apache/airflow/actions/runs/9383343196/job/25837180680?pr=40050#step:7:3547

Hmm yeah, I'd guess this has something to do with the flakiness with the TaskInstanceState.DEFERRED-state, as I encountered something similar while writing these tests. I can try to have a second look at some point.

kahlstrm avatar Jun 05 '24 12:06 kahlstrm

Filed a PR that simply removes the inconsistent test case for TaskInstanceState.DEFERRED, it can be found here https://github.com/apache/airflow/pull/40257 @ephraimbuddy

kahlstrm avatar Jun 15 '24 10:06 kahlstrm

@kahlstrm there are some issues with this code now that i'm looking into.

But here's a question for you. Why should we check served logs when the task is up for retry? I think it's reasonable to assume that when the task is in a running state, that the logs are where they are going to be -- either shared volume, or remote storage.

dstandish avatar Jul 31 '24 22:07 dstandish

@kahlstrm there are some issues with this code now that i'm looking into.

But here's a question for you. Why should we check served logs when the task is up for retry? I think it's reasonable to assume that when the task is in a running state, that the logs are where they are going to be -- either shared volume, or remote storage.

The issue we encountered is that when the state of the task instance state was up for retry, i.e. previous attempt had failed, the logs for the previous attempt weren't available for the time period when the task instance was in up for retry state.

Example:

Attempt 1 fails for some reason which is easiest visible in the logs. New Task instance is marked up for retry. During the time task instance is in up for retry state, the previous logs for the previous task attempts are not being fetched. The way to access the logs for that period of time is to ssh into the worker and find the log file.

Does this use case make sense? We started only observing these problems after upgrading to 2.7.0 and higher.

kahlstrm avatar Jul 31 '24 23:07 kahlstrm

The issue we encountered is that when the state of the task instance state was up for retry, i.e. previous attempt had failed, the logs for the previous attempt weren't available for the time period when the task instance was in up for retry state.

This doesn't sound right. When a task fails (whether there are remaing retries or not), the logs should be wherever they are going to be. Does your webserver have access to the shared volume?

dstandish avatar Aug 01 '24 21:08 dstandish