airflow icon indicating copy to clipboard operation
airflow copied to clipboard

WIP: [Resolve OOM When Reading Large Logs in Webserver] Refactor to Use K-Way Merge for Log Streams Instead of Sorting Entire Log Records

Open jason810496 opened this issue 1 year ago • 1 comments

related: #45079


^ Add meaningful description above Read the Pull Request Guidelines for more information. In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed. In case of a new dependency, check compliance with the ASF 3rd Party License Policy. In case of backwards incompatible changes please leave a note in a newsfragment file, named {pr_number}.significant.rst or {issue_number}.significant.rst, in newsfragments.

jason810496 avatar Dec 21 '24 06:12 jason810496

Rebased after we fixed main issue

potiuk avatar Dec 21 '24 08:12 potiuk

CI is failing due to: Please ask the maintainer to assign the 'legacy api' label to the PR in order to continue.

Since the get_log endpoint in both the legacy API and FastAPI uses the read_log_chunks method, it’s necessary to fix the endpoints and their corresponding tests.

jason810496 avatar Dec 23 '24 12:12 jason810496

Applied and closed/reopened to trigger the build

potiuk avatar Dec 23 '24 13:12 potiuk

Fix the provider tests that explicitly use the read or _read methods.

jason810496 avatar Dec 25 '24 04:12 jason810496

Finally fixed the tests!

This is the first (and likely the largest) PR for resolving OOM issues when reading large logs in the webserver. Further PRs will only focus on refactoring each provider, as listed in the TODO tasks in #45079.

Even though the providers haven't yet been refactored to support stream-based log reading, the compatibility utility will transform the old read log method (which returns the entire list of logs) into a stream-based approach. Once all providers are refactored to use stream-based reading, the compatibility utility can be removed.

For the testing part:
Since the CI will run provider compatibility tests for versions 2.9.3 and 2.10.3, my approach is to copy the old test cases related to log reading into new stream-based tests. I’ve added the mark_test_for_old_read_log_method and mark_test_for_stream_based_read_log_method pytest decorators to selectively skip the corresponding test runs. From my perspective, this approach is simpler and minimizes changes to the original test logic. Additionally, tests marked with mark_test_for_old_read_log_method can be safely removed once all providers migrate to stream-based reading.

jason810496 avatar Dec 26 '24 06:12 jason810496

Rebase to latest main, wait for review.

jason810496 avatar Jan 01 '25 16:01 jason810496

@jason810496 I rebased it -> we found and issue with @jscheffl with the new caching scheme - fixed in https://github.com/apache/airflow/pull/45347 that would run "main" version of the tests.

potiuk avatar Jan 02 '25 12:01 potiuk

Hi @dstandish,

Hope you're doing well, and Happy New Year! Would you mind taking a look at this PR when you have a moment? Thanks!

jason810496 avatar Jan 05 '25 07:01 jason810496

Just rebase to latest main, nothing update.

jason810496 avatar Jan 13 '25 11:01 jason810496

@dstandish @ashb - can we merge it ? That one seems like a good cnadidate for 2.10.5 ?

potiuk avatar Jan 25 '25 21:01 potiuk

It only really waits for your review/approval and it solves real issue.

potiuk avatar Jan 25 '25 21:01 potiuk

Hi @potiuk, may I ask whether this PR is considered a refactor for 3.0 or 2.10? I saw the 3.0 feature freeze mentioned in the dev list, so I’m not sure which version this PR will be counted for.

Here is the related discussion on Slack: https://apache-airflow.slack.com/archives/CCZRF2U5A/p1736767159693839

jason810496 avatar Feb 08 '25 12:02 jason810496

Hi @dstandish @ashb, hope you're doing well! Could you please review this PR when you have some time? Thanks! 🙏

jason810496 avatar Feb 17 '25 01:02 jason810496

I'm still half way finish reviewing it. Left a few nitpicks, but the PR is great to be honest.

Thanks, @Lee-W, for reviewing! I’ve just resolved those nits.

The CI failure is due to a flaky test:

FAILED tests/operators/test_trigger_dagrun.py::TestDagRunOperator::test_trigger_dagrun - AssertionError: assert equals failed
  '2025-02-18T08:18:13'  '2025-02-18T08:18:14'

jason810496 avatar Feb 18 '25 09:02 jason810496

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 5 days if no further activity occurs. Thank you for your contributions.

github-actions[bot] avatar Apr 05 '25 00:04 github-actions[bot]

Since the TaskHandler logger being migrate to structlog, I will create another PR for the refactor instead of resolve conflict on this one( too much code change and conflict on this path recently)

jason810496 avatar Apr 07 '25 10:04 jason810496

Since the TaskHandler logger being migrate to structlog, I will create another PR for the refactor instead of resolve conflict on this one( too much code change and conflict on this path recently)

If that's the case, maybe we could mark this as draft or close and create a new one instead?

Lee-W avatar Apr 08 '25 09:04 Lee-W

Close this PR since it’s superseded by:

  • #49470 (targets 3.0+)
  • #45914 (targets 2.11)

jason810496 avatar Apr 29 '25 12:04 jason810496