datafusion icon indicating copy to clipboard operation
datafusion copied to clipboard

Optimize set-associative nested function with Eq Kernel

Open jayzhan211 opened this issue 1 year ago • 1 comments

Is your feature request related to a problem or challenge?

We found that computing with Eq Kernel is much faster than RowConverter for array_has

There are other functions that have potential to further speedup with Eq Kernel.

  • [ ] array_has_all
  • [ ] array_has_any
  • [ ] array_intersect
  • [ ] array_distinct (Maybe 🤔 ?)
  • [ ] array_except (Maybe 🤔 ?)

Describe the solution you'd like

The overall idea is that we flatten the left hand side of the list and iterate right hand side elements, apply Eq kernel for each element. #12062 We could know whether the element is in left hand side of the list by checking the true_count (and null_count for null handling). The eq kernel is vectorized, that is the key of speedup.

Example

array_has_all

array_has_all([1,2,3], [1,2]) -> true Iterate [1,2], compare [1,2,3] with 1 and [1,2,3] with 2. We will get [true,false,false] and [false,true,false]. Both boolean array contains true, therefore return true

array_has_any

array_has_any([1,2,3], [1,4]) -> true Iterate [1,2], compare [1,2,3] with 1 and [1,2,3] with 4. We will get [true,false,false] and [false,false,false]. Since first boolean array contains true, therefore return true

array_has_intersect

array_has_intersect([1,2,3], [1,2]) -> [1,2] The same idea like above, we know that both element is contained in the list. The difference is that we expect to return Array. We could get the expected array with MutableArrayData

I'm not pretty sure about distinct and except, but worth to figure out as well

For more detail implementation could take array_has as reference

Describe alternatives you've considered

No response

Additional context

No response

jayzhan211 avatar Aug 26 '24 00:08 jayzhan211

Interesting. taking a look

dharanad avatar Aug 28 '24 19:08 dharanad