cube icon indicating copy to clipboard operation
cube copied to clipboard

CubeStore: wrong query result when filtering two left-joined tables

Open antoninkrotky opened this issue 4 years ago • 4 comments

Describe the bug CubeStore returns wrong result set when filtering on two left-joined tables.

To Reproduce

create table dev_pre_aggregations.employee (name varchar);
insert into dev_pre_aggregations.employee (name) values ('John');
insert into dev_pre_aggregations.employee (name) values ('Jim');

create table dev_pre_aggregations.employee_department_bridge (employee_name varchar, department_name varchar);
insert into dev_pre_aggregations.employee_department_bridge (employee_name,department_name) values ('John','Marketing');
insert into dev_pre_aggregations.employee_department_bridge (employee_name,department_name) values ('Jim','Marketing');

select * 
from dev_pre_aggregations.employee AS e 
LEFT JOIN dev_pre_aggregations.employee_department_bridge b on (b.employee_name=e.name)
where b.department_name='Non existing';

--Empty result set expected, but we get:
+------+---------------+-----------------+
| name | employee_name | department_name |
+------+---------------+-----------------+
| Jim  | NULL          | NULL            |
| John | NULL          | NULL            |
+------+---------------+-----------------+
2 rows in set (0.00 sec)

The example above is enough to reproduce this bug. To give a full picture - we found this bug when joining 3 tables but then were able to simplify the scenario. Add this code to see our original scenario:

create table dev_pre_aggregations.department (name varchar);
insert into dev_pre_aggregations.department (name) values ('Marketing');
insert into dev_pre_aggregations.department (name) values ('Finance');
insert into dev_pre_aggregations.department (name) values ('IT');

select * from
    dev_pre_aggregations.employee AS e
        LEFT JOIN dev_pre_aggregations.employee_department_bridge b on (b.employee_name=e.name)
        LEFT JOIN dev_pre_aggregations.department d on (b.department_name=d.name) where d.name = 'Non existing';

--Empty result set expected, but we get:
+------+---------------+-----------------+------+
| name | employee_name | department_name | name |
+------+---------------+-----------------+------+
| Jim  | NULL          | NULL            | NULL |
| John | NULL          | NULL            | NULL |
+------+---------------+-----------------+------+

Expected behavior 0 rows returned

Screenshots None

Version: v0.28.64

Additional context Discussed in Slack https://cube-js.slack.com/archives/CC0403RRR/p1638799039492100 Confirmed there by @igorlukanin

antoninkrotky avatar Dec 06 '21 21:12 antoninkrotky

See this comment in arrow-datafusion: https://github.com/apache/arrow-datafusion/pull/1339#issuecomment-982930856

kaja78 avatar Dec 08 '21 07:12 kaja78

Hey @kaja78 ! Thanks for posting this comment! This is helpful. Yeah. It can be related to this optimization.

paveltiunov avatar Dec 09 '21 04:12 paveltiunov

It appears that the underlying Arrow issue has been resolved ... are there plans to update the packaged arrow codebase in cube to include this fix? https://github.com/apache/arrow-datafusion/pull/1618

mjstephenson72 avatar Jun 21 '22 14:06 mjstephenson72

@paveltiunov has this been incorporated into the cube js fork?

javiramos1 avatar Aug 01 '22 12:08 javiramos1