vortex icon indicating copy to clipboard operation
vortex copied to clipboard

[blocked] Switch to Utf8View for TPC-H

Open a10y opened this issue 1 year ago • 9 comments

Blocked on https://github.com/apache/datafusion/pull/11518 and https://github.com/apache/arrow-rs/pull/6077

a10y avatar Jul 17 '24 15:07 a10y

FWIW this is unblocked if you were to use master of datafusion

robert3005 avatar Jul 30 '24 13:07 robert3005

Initial TPC-H benchmarks comparison:

aduffy@DuffyProBook ~/c/vortex (aduffy/utf8view) [1]> critcmp develop-vortex utf8view-vortex
group                                develop-vortex                         utf8view-vortex
-----                                --------------                         ---------------
tpch_q1/vortex-pushdown-disabled     1.00    339.2±0.83ms        ? ?/sec    1.24    421.0±1.13ms        ? ?/sec
tpch_q10/vortex-pushdown-disabled    1.00    165.3±5.40ms        ? ?/sec    1.34    222.0±3.07ms        ? ?/sec
tpch_q11/vortex-pushdown-disabled    1.00    105.7±3.73ms        ? ?/sec    1.26   133.3±17.73ms        ? ?/sec
tpch_q12/vortex-pushdown-disabled    1.00    119.1±3.35ms        ? ?/sec    2.41    287.2±8.59ms        ? ?/sec
tpch_q13/vortex-pushdown-disabled    1.00    156.1±2.57ms        ? ?/sec    1.59   248.4±21.85ms        ? ?/sec
tpch_q14/vortex-pushdown-disabled    1.00     21.1±0.64ms        ? ?/sec    1.58     33.3±1.74ms        ? ?/sec
tpch_q16/vortex-pushdown-disabled    1.00     78.0±2.25ms        ? ?/sec    1.37   107.2±13.18ms        ? ?/sec
tpch_q17/vortex-pushdown-disabled    1.00    307.4±8.72ms        ? ?/sec    1.12   344.4±10.19ms        ? ?/sec
tpch_q18/vortex-pushdown-disabled    1.00   588.3±48.01ms        ? ?/sec    1.04   611.9±23.91ms        ? ?/sec
tpch_q19/vortex-pushdown-disabled    1.00     99.8±0.56ms        ? ?/sec    3.84    383.4±3.93ms        ? ?/sec
tpch_q2/vortex-pushdown-disabled     1.00     77.6±1.22ms        ? ?/sec    1.12     86.8±0.90ms        ? ?/sec
tpch_q20/vortex-pushdown-disabled    1.00    134.3±7.21ms        ? ?/sec    1.10    147.5±5.04ms        ? ?/sec
tpch_q21/vortex-pushdown-disabled    1.00   543.7±33.73ms        ? ?/sec    1.10   598.0±25.40ms        ? ?/sec
tpch_q22/vortex-pushdown-disabled    1.00     66.3±1.76ms        ? ?/sec    1.10     72.6±1.95ms        ? ?/sec
tpch_q3/vortex-pushdown-disabled     1.00     98.0±2.36ms        ? ?/sec    1.05    103.1±4.27ms        ? ?/sec
tpch_q4/vortex-pushdown-disabled     1.00     69.5±1.13ms        ? ?/sec    1.51    104.7±2.85ms        ? ?/sec
tpch_q5/vortex-pushdown-disabled     1.00    153.9±6.25ms        ? ?/sec    1.01    156.0±6.22ms        ? ?/sec
tpch_q6/vortex-pushdown-disabled     1.00     15.7±0.17ms        ? ?/sec    1.01     15.9±0.35ms        ? ?/sec
tpch_q7/vortex-pushdown-disabled     1.00    297.5±1.56ms        ? ?/sec    1.08    321.1±9.67ms        ? ?/sec
tpch_q8/vortex-pushdown-disabled     1.00    135.5±2.94ms        ? ?/sec    1.08    146.7±3.22ms        ? ?/sec
tpch_q9/vortex-pushdown-disabled     1.00   278.7±19.86ms        ? ?/sec    1.14    319.0±6.45ms        ? ?/sec

Going to profile some of the big boys, starting with q19

a10y avatar Jul 30 '24 18:07 a10y

Oh right, I forgot that into_canonical currently copies the world 🤦

image

should be an easy fix

a10y avatar Jul 30 '24 19:07 a10y

Alright, a bit warmer now:

$ critcmp develop-vortex utf8view-vortex-fixed-faster

group                                develop-vortex                         utf8view-vortex-fixed-faster
-----                                --------------                         ----------------------------
tpch_q1/vortex-pushdown-disabled     1.00    339.2±0.83ms        ? ?/sec    1.13    382.8±3.05ms        ? ?/sec
tpch_q10/vortex-pushdown-disabled    1.00    165.3±5.40ms        ? ?/sec    1.50    247.8±0.57ms        ? ?/sec
tpch_q11/vortex-pushdown-disabled    1.00    105.7±3.73ms        ? ?/sec    1.11    117.3±7.19ms        ? ?/sec
tpch_q12/vortex-pushdown-disabled    1.00    119.1±3.35ms        ? ?/sec    1.27    151.5±1.06ms        ? ?/sec
tpch_q13/vortex-pushdown-disabled    1.00    156.1±2.57ms        ? ?/sec    1.01    157.7±2.25ms        ? ?/sec
tpch_q14/vortex-pushdown-disabled    1.00     21.1±0.64ms        ? ?/sec    1.34     28.2±0.72ms        ? ?/sec
tpch_q16/vortex-pushdown-disabled    1.00     78.0±2.25ms        ? ?/sec    1.25     97.6±1.14ms        ? ?/sec
tpch_q17/vortex-pushdown-disabled    1.00    307.4±8.72ms        ? ?/sec    1.12    342.9±3.81ms        ? ?/sec
tpch_q18/vortex-pushdown-disabled    1.00   588.3±48.01ms        ? ?/sec    1.35   795.7±81.62ms        ? ?/sec
tpch_q19/vortex-pushdown-disabled    1.00     99.8±0.56ms        ? ?/sec    1.37    136.9±3.13ms        ? ?/sec
tpch_q2/vortex-pushdown-disabled     1.00     77.6±1.22ms        ? ?/sec    1.07     83.0±2.37ms        ? ?/sec
tpch_q20/vortex-pushdown-disabled    1.00    134.3±7.21ms        ? ?/sec    1.05    141.1±2.10ms        ? ?/sec
tpch_q21/vortex-pushdown-disabled    1.00   543.7±33.73ms        ? ?/sec    1.46   793.8±27.10ms        ? ?/sec
tpch_q22/vortex-pushdown-disabled    1.00     66.3±1.76ms        ? ?/sec    1.06     70.5±1.44ms        ? ?/sec
tpch_q3/vortex-pushdown-disabled     1.00     98.0±2.36ms        ? ?/sec    1.01     99.5±1.49ms        ? ?/sec
tpch_q4/vortex-pushdown-disabled     1.00     69.5±1.13ms        ? ?/sec    1.04     72.6±0.81ms        ? ?/sec
tpch_q5/vortex-pushdown-disabled     1.00    153.9±6.25ms        ? ?/sec    1.02    157.7±2.30ms        ? ?/sec
tpch_q6/vortex-pushdown-disabled     1.00     15.7±0.17ms        ? ?/sec    1.01     15.9±0.18ms        ? ?/sec
tpch_q7/vortex-pushdown-disabled     1.00    297.5±1.56ms        ? ?/sec    1.15    342.7±8.28ms        ? ?/sec
tpch_q8/vortex-pushdown-disabled     1.00    135.5±2.94ms        ? ?/sec    1.05    142.6±3.56ms        ? ?/sec
tpch_q9/vortex-pushdown-disabled     1.00   278.7±19.86ms        ? ?/sec    1.14   317.8±11.65ms        ? ?/sec

Digging in on q10 and q21, it looks like Arrow's take implementation for StringView does a full utf8 validation, something that the take implementation for normal StringArray doesn't do. If we addressed that upstream, I suspect this would put us at parity or above the Utf8 implementation.

image

a10y avatar Aug 01 '24 00:08 a10y

PR upstream for better take kernel: https://github.com/apache/arrow-rs/pull/6168

a10y avatar Aug 01 '24 00:08 a10y

Looks like this still need C Data interface fixes upstream

robert3005 avatar Aug 01 '24 09:08 robert3005

https://github.com/apache/arrow-rs/pull/6171

a10y avatar Aug 01 '24 14:08 a10y

CI won't succeed until this is merged:

  • https://github.com/apache/arrow-rs/pull/6171

Benches will continue to be slower than regular Utf8 until this merges:

  • https://github.com/apache/arrow-rs/pull/6168

We may also want to consider how we handle arrays that canonicalize to something that doesn't fit in Arrow. It's not crazy to build a dictionary-encoded array that has >2GB of string data in a single buffer. That would add complexity to either our internal logic, or the into_arrow logic.

Discussion upstream on i32/u32 for BinaryView happening at https://github.com/apache/arrow-rs/issues/6172

a10y avatar Aug 01 '24 14:08 a10y

Alright, the above 2 PRs have merged into arrow-rs, which means we now need to wait for them to make their way into DataFusion to get the pytests passing.

Looks like there's some amount of agreement on the discussion ticket that i32 should actually be used for BinaryView.

We basically have a few options on our end for how we want our VarBinView array to work:

  • Use i32 internally
  • Continue to use u32 internally, but only support arrays with blocks of size <=2GB
  • Continue to use u32 internally, and when we do into_arrow we can zero-copy it to arrow if all blocks are <=2GB, else we repack into multiple blocks

a10y avatar Aug 02 '24 16:08 a10y

Superceded by #757

a10y avatar Sep 05 '24 21:09 a10y