arrow icon indicating copy to clipboard operation
arrow copied to clipboard

GH-34235: [Python] Add `join_asof` binding

Open judahrand opened this issue 2 years ago • 25 comments

  • Closes: #34235

judahrand avatar Feb 17 '23 12:02 judahrand

Thanks for opening a pull request!

If this is not a minor PR. Could you open an issue for this pull request on GitHub? https://github.com/apache/arrow/issues/new/choose

Opening GitHub issues ahead of time contributes to the Openness of the Apache Arrow project.

Then could you also rename the pull request title in the following format?

GH-${GITHUB_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}

or

MINOR: [${COMPONENT}] ${SUMMARY}

In the case of PARQUET issues on JIRA the title also supports:

PARQUET-${JIRA_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}

See also:

github-actions[bot] avatar Feb 17 '23 12:02 github-actions[bot]

  • Closes: #34235

github-actions[bot] avatar Feb 17 '23 12:02 github-actions[bot]

:warning: GitHub issue #34235 has been automatically assigned in GitHub to PR creator.

github-actions[bot] avatar Feb 17 '23 12:02 github-actions[bot]

I'm struggling to run the tests locally as I can't get Arrow to build on an M1 Mac.

Undefined symbols for architecture arm64:
  "testing::Matcher<std::__1::basic_string_view<char, std::__1::char_traits<char> > const&>::Matcher(char const*)", referenced from:
      testing::Matcher<std::__1::basic_string_view<char, std::__1::char_traits<char> > const&> testing::internal::MatcherCastImpl<std::__1::basic_string_view<char, std::__1::char_traits<char> > const&, char const*>::CastImpl<true>(char const* const&, std::__1::integral_constant<bool, true>, std::__1::integral_constant<bool, true>) in array_test.cc.o
      testing::Matcher<std::__1::basic_string_view<char, std::__1::char_traits<char> > const&> testing::internal::MatcherCastImpl<std::__1::basic_string_view<char, std::__1::char_traits<char> > const&, char const*>::CastImpl<true>(char const* const&, std::__1::integral_constant<bool, true>, std::__1::integral_constant<bool, true>) in array_binary_test.cc.o
ld: symbol(s) not found for architecture arm64
clang: error: linker command failed with exit code 1 (use -v to see invocation)
make[2]: *** [src/arrow/CMakeFiles/arrow-array-test.dir/build.make:208: build/debug/arrow-array-test] Error 1
make[1]: *** [CMakeFiles/Makefile2:1714: src/arrow/CMakeFiles/arrow-array-test.dir/all] Error 2
make[1]: *** Waiting for unfinished jobs....
[ 54%] Built target arrow-public-api-test

judahrand avatar Feb 17 '23 12:02 judahrand

I'm struggling to run the tests locally as I can't get Arrow to build on an M1 Mac.

Can you try adding -DGTest_SOURCE="BUNDLED" CMake flag? See https://github.com/apache/arrow/issues/14917

AlenkaF avatar Feb 20 '23 05:02 AlenkaF

@AlenkaF I believe that this is ready for a round of review when you have time.

judahrand avatar Feb 20 '23 11:02 judahrand

Hi @judahrand , thank you for your contribution! ⭐

The code LGTM 👍 I would maybe add tests to test_table.py also, as in https://github.com/apache/arrow/pull/12452/files#diff-72bd29ba764c85f18fbf9e74898b72d47ed8a1b04be879c1a5b2a59382e2eaef

AlenkaF avatar Feb 20 '23 15:02 AlenkaF

Hi @judahrand , thank you for your contribution! ⭐

The code LGTM 👍 I would maybe add tests to test_table.py also, as in https://github.com/apache/arrow/pull/12452/files#diff-72bd29ba764c85f18fbf9e74898b72d47ed8a1b04be879c1a5b2a59382e2eaef

Done

judahrand avatar Feb 20 '23 16:02 judahrand

The Appveyor failure looks like a timeout?

judahrand avatar Feb 21 '23 09:02 judahrand

Yeah, I see the Appveyor build is failing on other PRs also.

AlenkaF avatar Feb 21 '23 09:02 AlenkaF

@AlenkaF Do you need me to do anything else on this PR?

judahrand avatar Feb 27 '23 10:02 judahrand

Just one more thing: could you rebase to latest master to get AppVeyor working? (related issue was closed: https://github.com/apache/arrow/issues/34296)

AlenkaF avatar Feb 27 '23 10:02 AlenkaF

@judahrand thanks for working on this!

Before diving into the details, I have two general comments:

  • We are in the middle of refactoring how we expose the Acero / ExecPlan features, and I have a PR that exposes the Declaration object and ExecNodeOptions subclasses in Python (https://github.com/apache/arrow/pull/34102). Once that is merged, it should be the goal that also the asof join could be exposed by adding an AsofJoinNodeOptions class in pyarrow. Ideally, I would prefer that we can do this directly, but I know that the mentioned PR isn't merged yet (I hope it can be merged in one of the coming days, though). Anyway, most of the work you are doing in this PR is needed anyway (public API, docs, tests, etc), it's only the part in _exec_plan.pyx that would change a bit.

  • I think we should try to do a better job of explaining what the "asof" exactly is and does in the docstring (I also noted that AsOfJoin is missing in the C++ user guide (will open an issue about that), although it has some reference docs), since I think this is generally not a very well known join type: what is the difference with a normal join? What is the difference between the "on" and "by" join keys?

jorisvandenbossche avatar Feb 27 '23 11:02 jorisvandenbossche

@judahrand I merged https://github.com/apache/arrow/pull/34102, so you should be able to add a AsofJoinNodeOptions subclass, and then update _perform_join_asof to use that options class together with Declaration (similar as I am doing in https://github.com/apache/arrow/pull/3440)

jorisvandenbossche avatar Mar 03 '23 12:03 jorisvandenbossche

Closing because it has been untouched for a while, in case it's still relevant feel free to reopen and move it forward 👍

amol- avatar Mar 30 '23 17:03 amol-

@judahrand to be clear, there is still interest in this!

jorisvandenbossche avatar Mar 30 '23 17:03 jorisvandenbossche

I'll try and find the time to push this through 🙏

judahrand avatar Apr 05 '23 21:04 judahrand

Hey @judahrand! would you consider moving this PR to draft mode to signify changes are still needed?

danepitkin avatar Jun 06 '23 15:06 danepitkin

@jorisvandenbossche Sorry this took me so long to revisit! I think I've dealt with all of your comments?

judahrand avatar Sep 08 '23 14:09 judahrand

@jorisvandenbossche Sorry this took me so long to revisit! I think I've dealt with all of your comments?

Merged main in which should fix the appveyor test failure.

judahrand avatar Sep 20 '23 07:09 judahrand

Thanks!

AppVeyor failure is not connected - seems to be failing elsewhere also. Opened https://github.com/apache/arrow/issues/38431

I have re-run C GLib & Ruby job, don't think the failure is connected though.

Otherwise this looks good to me 👍 @jorisvandenbossche would you like to have a look before we merge?

AlenkaF avatar Oct 24 '23 06:10 AlenkaF

@jorisvandenbossche Do you think you might be able to take another look at this? I know I've been quite slow at pushing this PR forward but I think it potentially ready to be merged now.

judahrand avatar Nov 13 '23 11:11 judahrand

@jorisvandenbossche @AlenkaF ping?

judahrand avatar Jan 05 '24 15:01 judahrand

@jorisvandenbossche @AlenkaF ping?

judahrand avatar Feb 28 '24 16:02 judahrand

@jorisvandenbossche I believe I've dealt will all the feedback 😄

judahrand avatar Mar 04 '24 12:03 judahrand

After merging your PR, Conbench analyzed the 7 benchmarking runs that have been run so far on merge-commit 681be03cfc63fbdda1e3a445d38f20b0434cb1c2.

There were 9 benchmark results indicating a performance regression:

The full Conbench report has more details. It also includes information about 1 possible false positive for unstable benchmarks that are known to sometimes produce them.