CubeStore: wrong query result when filtering two left-joined tables
Describe the bug CubeStore returns wrong result set when filtering on two left-joined tables.
To Reproduce
create table dev_pre_aggregations.employee (name varchar);
insert into dev_pre_aggregations.employee (name) values ('John');
insert into dev_pre_aggregations.employee (name) values ('Jim');
create table dev_pre_aggregations.employee_department_bridge (employee_name varchar, department_name varchar);
insert into dev_pre_aggregations.employee_department_bridge (employee_name,department_name) values ('John','Marketing');
insert into dev_pre_aggregations.employee_department_bridge (employee_name,department_name) values ('Jim','Marketing');
select *
from dev_pre_aggregations.employee AS e
LEFT JOIN dev_pre_aggregations.employee_department_bridge b on (b.employee_name=e.name)
where b.department_name='Non existing';
--Empty result set expected, but we get:
+------+---------------+-----------------+
| name | employee_name | department_name |
+------+---------------+-----------------+
| Jim | NULL | NULL |
| John | NULL | NULL |
+------+---------------+-----------------+
2 rows in set (0.00 sec)
The example above is enough to reproduce this bug. To give a full picture - we found this bug when joining 3 tables but then were able to simplify the scenario. Add this code to see our original scenario:
create table dev_pre_aggregations.department (name varchar);
insert into dev_pre_aggregations.department (name) values ('Marketing');
insert into dev_pre_aggregations.department (name) values ('Finance');
insert into dev_pre_aggregations.department (name) values ('IT');
select * from
dev_pre_aggregations.employee AS e
LEFT JOIN dev_pre_aggregations.employee_department_bridge b on (b.employee_name=e.name)
LEFT JOIN dev_pre_aggregations.department d on (b.department_name=d.name) where d.name = 'Non existing';
--Empty result set expected, but we get:
+------+---------------+-----------------+------+
| name | employee_name | department_name | name |
+------+---------------+-----------------+------+
| Jim | NULL | NULL | NULL |
| John | NULL | NULL | NULL |
+------+---------------+-----------------+------+
Expected behavior 0 rows returned
Screenshots None
Version:
v0.28.64
Additional context Discussed in Slack https://cube-js.slack.com/archives/CC0403RRR/p1638799039492100 Confirmed there by @igorlukanin
See this comment in arrow-datafusion: https://github.com/apache/arrow-datafusion/pull/1339#issuecomment-982930856
Hey @kaja78 ! Thanks for posting this comment! This is helpful. Yeah. It can be related to this optimization.
It appears that the underlying Arrow issue has been resolved ... are there plans to update the packaged arrow codebase in cube to include this fix? https://github.com/apache/arrow-datafusion/pull/1618
@paveltiunov has this been incorporated into the cube js fork?