GH-34235: [Python] Add `join_asof` binding
- Closes: #34235
Thanks for opening a pull request!
If this is not a minor PR. Could you open an issue for this pull request on GitHub? https://github.com/apache/arrow/issues/new/choose
Opening GitHub issues ahead of time contributes to the Openness of the Apache Arrow project.
Then could you also rename the pull request title in the following format?
GH-${GITHUB_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}
or
MINOR: [${COMPONENT}] ${SUMMARY}
In the case of PARQUET issues on JIRA the title also supports:
PARQUET-${JIRA_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}
See also:
- Closes: #34235
:warning: GitHub issue #34235 has been automatically assigned in GitHub to PR creator.
I'm struggling to run the tests locally as I can't get Arrow to build on an M1 Mac.
Undefined symbols for architecture arm64:
"testing::Matcher<std::__1::basic_string_view<char, std::__1::char_traits<char> > const&>::Matcher(char const*)", referenced from:
testing::Matcher<std::__1::basic_string_view<char, std::__1::char_traits<char> > const&> testing::internal::MatcherCastImpl<std::__1::basic_string_view<char, std::__1::char_traits<char> > const&, char const*>::CastImpl<true>(char const* const&, std::__1::integral_constant<bool, true>, std::__1::integral_constant<bool, true>) in array_test.cc.o
testing::Matcher<std::__1::basic_string_view<char, std::__1::char_traits<char> > const&> testing::internal::MatcherCastImpl<std::__1::basic_string_view<char, std::__1::char_traits<char> > const&, char const*>::CastImpl<true>(char const* const&, std::__1::integral_constant<bool, true>, std::__1::integral_constant<bool, true>) in array_binary_test.cc.o
ld: symbol(s) not found for architecture arm64
clang: error: linker command failed with exit code 1 (use -v to see invocation)
make[2]: *** [src/arrow/CMakeFiles/arrow-array-test.dir/build.make:208: build/debug/arrow-array-test] Error 1
make[1]: *** [CMakeFiles/Makefile2:1714: src/arrow/CMakeFiles/arrow-array-test.dir/all] Error 2
make[1]: *** Waiting for unfinished jobs....
[ 54%] Built target arrow-public-api-test
I'm struggling to run the tests locally as I can't get Arrow to build on an M1 Mac.
Can you try adding -DGTest_SOURCE="BUNDLED" CMake flag? See https://github.com/apache/arrow/issues/14917
@AlenkaF I believe that this is ready for a round of review when you have time.
Hi @judahrand , thank you for your contribution! ⭐
The code LGTM 👍 I would maybe add tests to test_table.py also, as in https://github.com/apache/arrow/pull/12452/files#diff-72bd29ba764c85f18fbf9e74898b72d47ed8a1b04be879c1a5b2a59382e2eaef
Hi @judahrand , thank you for your contribution! ⭐
The code LGTM 👍 I would maybe add tests to
test_table.pyalso, as in https://github.com/apache/arrow/pull/12452/files#diff-72bd29ba764c85f18fbf9e74898b72d47ed8a1b04be879c1a5b2a59382e2eaef
Done
The Appveyor failure looks like a timeout?
Yeah, I see the Appveyor build is failing on other PRs also.
@AlenkaF Do you need me to do anything else on this PR?
Just one more thing: could you rebase to latest master to get AppVeyor working? (related issue was closed: https://github.com/apache/arrow/issues/34296)
@judahrand thanks for working on this!
Before diving into the details, I have two general comments:
-
We are in the middle of refactoring how we expose the Acero / ExecPlan features, and I have a PR that exposes the Declaration object and ExecNodeOptions subclasses in Python (https://github.com/apache/arrow/pull/34102). Once that is merged, it should be the goal that also the asof join could be exposed by adding an
AsofJoinNodeOptionsclass in pyarrow. Ideally, I would prefer that we can do this directly, but I know that the mentioned PR isn't merged yet (I hope it can be merged in one of the coming days, though). Anyway, most of the work you are doing in this PR is needed anyway (public API, docs, tests, etc), it's only the part in _exec_plan.pyx that would change a bit. -
I think we should try to do a better job of explaining what the "asof" exactly is and does in the docstring (I also noted that AsOfJoin is missing in the C++ user guide (will open an issue about that), although it has some reference docs), since I think this is generally not a very well known join type: what is the difference with a normal join? What is the difference between the "on" and "by" join keys?
@judahrand I merged https://github.com/apache/arrow/pull/34102, so you should be able to add a AsofJoinNodeOptions subclass, and then update _perform_join_asof to use that options class together with Declaration (similar as I am doing in https://github.com/apache/arrow/pull/3440)
Closing because it has been untouched for a while, in case it's still relevant feel free to reopen and move it forward 👍
@judahrand to be clear, there is still interest in this!
I'll try and find the time to push this through 🙏
Hey @judahrand! would you consider moving this PR to draft mode to signify changes are still needed?
@jorisvandenbossche Sorry this took me so long to revisit! I think I've dealt with all of your comments?
@jorisvandenbossche Sorry this took me so long to revisit! I think I've dealt with all of your comments?
Merged main in which should fix the appveyor test failure.
Thanks!
AppVeyor failure is not connected - seems to be failing elsewhere also. Opened https://github.com/apache/arrow/issues/38431
I have re-run C GLib & Ruby job, don't think the failure is connected though.
Otherwise this looks good to me 👍 @jorisvandenbossche would you like to have a look before we merge?
@jorisvandenbossche Do you think you might be able to take another look at this? I know I've been quite slow at pushing this PR forward but I think it potentially ready to be merged now.
@jorisvandenbossche @AlenkaF ping?
@jorisvandenbossche @AlenkaF ping?
@jorisvandenbossche I believe I've dealt will all the feedback 😄
After merging your PR, Conbench analyzed the 7 benchmarking runs that have been run so far on merge-commit 681be03cfc63fbdda1e3a445d38f20b0434cb1c2.
There were 9 benchmark results indicating a performance regression:
- Commit Run on
ursa-i9-9960xat 2024-03-16 07:11:19Z - and 7 more (see the report linked below)
The full Conbench report has more details. It also includes information about 1 possible false positive for unstable benchmarks that are known to sometimes produce them.