Join behaves randomly (needs hash join/auto-sort)
For a reproducibility: I'm using zq Version: v1.0.0 and this data: https://gist.github.com/fyyying/4aa5b471860321d7b47fd881898162b7
If you run the code with the following query
fork (
=> tot:=count() by Pclass
=> cnt:=count() by Pclass, Survived
)
| right join on Pclass=Pclass tot
zq behaves randomly, at each invocation joining different rows, on subsequent calls returning outputs like:
{Pclass:3.,Survived:0.,cnt:372(uint64),tot:491(uint64)}
{Pclass:1.,Survived:1.,cnt:136(uint64),tot:216(uint64)}
{Pclass:3.,Survived:1.,cnt:119(uint64)}
{Pclass:1.,Survived:0.,cnt:80(uint64),tot:216(uint64)}
{Pclass:2.,Survived:1.,cnt:87(uint64)}
{Pclass:2.,Survived:0.,cnt:97(uint64)}
or
{Pclass:3.,Survived:1.,cnt:119(uint64),tot:491(uint64)}
{Pclass:1.,Survived:0.,cnt:80(uint64),tot:216(uint64)}
{Pclass:2.,Survived:1.,cnt:87(uint64),tot:184(uint64)}
{Pclass:2.,Survived:0.,cnt:97(uint64),tot:184(uint64)}
{Pclass:3.,Survived:0.,cnt:372(uint64)}
{Pclass:1.,Survived:1.,cnt:136(uint64)}
Thanks @twolodzko
The join operator currently requires its inputs to be sorted on the join key. It's still a bit of a work in progress and the docs should mark this more clearly. Yes, this is clunky and we've intended for a while to fix it.
The aggregation outputs are (by design) not deterministic and require an explicit sort to force a deterministic order.
So if you add the sorts you should get correct join behavior, e.g.,
fork (
=> tot:=count() by Pclass | sort Pclass
=> cnt:=count() by Pclass, Survived | sort Pclass
)
| right join on Pclass=Pclass tot | sort tot,cnt
I'll leave this issue open for now as we should go through the backlog of issues relating to join, clean them up, and figure out our plan.
It's time we fix this stuff. The first step will be fix join so it doesn't require an explicit sort and uses a hash join by default then switches to merge join when a mem limit is hit.
A community zync user recently pointed out that the lack of auto-sort on join "blocks everyone [on their team] when they learn Zed".
Update: Another new member of the team of the community zync user cited in the prior comment just bumped into this problem yet again.
The linked PR #4770 has addressed the "auto-sort" portion of this issue. Below is verification of this as of Zed commit d938fba.
Using the CSV data from the Gist linked above and the following Zed program join.zed:
fork (
=> tot:=count() by Pclass
=> cnt:=count() by Pclass, Survived
)
| right join on Pclass=Pclass tot
The output of the following is now correct & consistent every time instead of returning random different results as it did before.
$ zq -version
Version: v1.9.0-26-gd938fba1
$ zq -i csv -I join.zed '| sort tot,cnt' titanic_dataset.csv
{Pclass:2.,Survived:1.,cnt:87(uint64),tot:184(uint64)}
{Pclass:2.,Survived:0.,cnt:97(uint64),tot:184(uint64)}
{Pclass:1.,Survived:0.,cnt:80(uint64),tot:216(uint64)}
{Pclass:1.,Survived:1.,cnt:136(uint64),tot:216(uint64)}
{Pclass:3.,Survived:1.,cnt:119(uint64),tot:491(uint64)}
{Pclass:3.,Survived:0.,cnt:372(uint64),tot:491(uint64)}
I've also confirmed with the Dev team that the enhancement works such that if a branch of input data was pulled from a Zed lake such that the data is known to already have been sorted in the appropriate direction, no redundant sort is performed.
The updated Join Tutorial docs show all the examples now without the explicit sort that used to be necessary (contrast with the Join Tutorial for Zed v1.9.0 when they were still necessary). All of these examples are run as part of CI, so this is further reflection that the enhancement is working as intended.
This issue will remain open to track when we add the additional enhancement to do hash join as well.
Thanks @mattnibs!