materialize icon indicating copy to clipboard operation
materialize copied to clipboard

Optimization: Redundant join does not work if projects are inconsistently absorbed by reduces.

Open wangandi opened this issue 4 years ago • 1 comments

What version of Materialize are you using?

latest main

How did you install Materialize?

Linux release tarball

What was the issue?

Discovered by @andrioni. See slack thread here

materialize=> explain select * from t1 where t1.f1 in (select f1 from t2);
          Optimized Plan
-----------------------------------
 %0 =                             +
 | Get materialize.public.t1 (u80)+
 | ArrangeBy (#0)                 +
                                  +
 %1 =                             +
 | Get materialize.public.t1 (u80)+
 | Filter !(isnull(#0))           +
 | Project (#0)                   +
 | Distinct group=(#0)            +
 | ArrangeBy (#0)                 +
                                  +
 %2 =                             +
 | Get materialize.public.t2 (u73)+
 | Filter !(isnull(#0))           +
 | Distinct group=(#0)            +
 | ArrangeBy (#0)                 +
                                  +
 %3 =                             +
 | Join %0 %1 %2 (= #0 #2 #3)     +
 | | implementation = DeltaQuery  +
 | |   delta %0 %1.(#0) %2.(#0)   +
 | |   delta %1 %2.(#0) %0.(#0)   +
 | |   delta %2 %1.(#0) %0.(#0)   +
 | Project (#0, #1)               +

I have done a unit test that confirms Redundant Join is unable to detect that | Project (#0) | Distinct (#0) is the same thing as | Distinct (#0).

If you want to run it for yourself, copy it into the transform unit tests.

build apply=RedundantJoin
(join
    [(get x)
    (reduce 
        (project 
            (filter 
                (get x)
                [(call_unary not (call_unary is_null #0))]
            )
        [0]
        ) 
        [#0] []
    )
    (reduce 
        (filter 
            (get x)
            [(call_unary not (call_unary is_null #0))]
        )
    [#0] [])]
    [[#0 #2 #3]]
)
----

Rather than giving redundant join the ability to determine that | Project (#0) | Distinct (#0) is the same thing as | Distinct (#0), we should probably:

  1. ensure that all projects get absorbed by reduces in logical planning.
  2. spawn projects out of reduce during physical planning.

Relevant log output

No response

wangandi avatar Mar 25 '22 23:03 wangandi