[blocked] Switch to Utf8View for TPC-H
Blocked on https://github.com/apache/datafusion/pull/11518 and https://github.com/apache/arrow-rs/pull/6077
FWIW this is unblocked if you were to use master of datafusion
Initial TPC-H benchmarks comparison:
aduffy@DuffyProBook ~/c/vortex (aduffy/utf8view) [1]> critcmp develop-vortex utf8view-vortex
group develop-vortex utf8view-vortex
----- -------------- ---------------
tpch_q1/vortex-pushdown-disabled 1.00 339.2±0.83ms ? ?/sec 1.24 421.0±1.13ms ? ?/sec
tpch_q10/vortex-pushdown-disabled 1.00 165.3±5.40ms ? ?/sec 1.34 222.0±3.07ms ? ?/sec
tpch_q11/vortex-pushdown-disabled 1.00 105.7±3.73ms ? ?/sec 1.26 133.3±17.73ms ? ?/sec
tpch_q12/vortex-pushdown-disabled 1.00 119.1±3.35ms ? ?/sec 2.41 287.2±8.59ms ? ?/sec
tpch_q13/vortex-pushdown-disabled 1.00 156.1±2.57ms ? ?/sec 1.59 248.4±21.85ms ? ?/sec
tpch_q14/vortex-pushdown-disabled 1.00 21.1±0.64ms ? ?/sec 1.58 33.3±1.74ms ? ?/sec
tpch_q16/vortex-pushdown-disabled 1.00 78.0±2.25ms ? ?/sec 1.37 107.2±13.18ms ? ?/sec
tpch_q17/vortex-pushdown-disabled 1.00 307.4±8.72ms ? ?/sec 1.12 344.4±10.19ms ? ?/sec
tpch_q18/vortex-pushdown-disabled 1.00 588.3±48.01ms ? ?/sec 1.04 611.9±23.91ms ? ?/sec
tpch_q19/vortex-pushdown-disabled 1.00 99.8±0.56ms ? ?/sec 3.84 383.4±3.93ms ? ?/sec
tpch_q2/vortex-pushdown-disabled 1.00 77.6±1.22ms ? ?/sec 1.12 86.8±0.90ms ? ?/sec
tpch_q20/vortex-pushdown-disabled 1.00 134.3±7.21ms ? ?/sec 1.10 147.5±5.04ms ? ?/sec
tpch_q21/vortex-pushdown-disabled 1.00 543.7±33.73ms ? ?/sec 1.10 598.0±25.40ms ? ?/sec
tpch_q22/vortex-pushdown-disabled 1.00 66.3±1.76ms ? ?/sec 1.10 72.6±1.95ms ? ?/sec
tpch_q3/vortex-pushdown-disabled 1.00 98.0±2.36ms ? ?/sec 1.05 103.1±4.27ms ? ?/sec
tpch_q4/vortex-pushdown-disabled 1.00 69.5±1.13ms ? ?/sec 1.51 104.7±2.85ms ? ?/sec
tpch_q5/vortex-pushdown-disabled 1.00 153.9±6.25ms ? ?/sec 1.01 156.0±6.22ms ? ?/sec
tpch_q6/vortex-pushdown-disabled 1.00 15.7±0.17ms ? ?/sec 1.01 15.9±0.35ms ? ?/sec
tpch_q7/vortex-pushdown-disabled 1.00 297.5±1.56ms ? ?/sec 1.08 321.1±9.67ms ? ?/sec
tpch_q8/vortex-pushdown-disabled 1.00 135.5±2.94ms ? ?/sec 1.08 146.7±3.22ms ? ?/sec
tpch_q9/vortex-pushdown-disabled 1.00 278.7±19.86ms ? ?/sec 1.14 319.0±6.45ms ? ?/sec
Going to profile some of the big boys, starting with q19
Oh right, I forgot that into_canonical currently copies the world 🤦
should be an easy fix
Alright, a bit warmer now:
$ critcmp develop-vortex utf8view-vortex-fixed-faster
group develop-vortex utf8view-vortex-fixed-faster
----- -------------- ----------------------------
tpch_q1/vortex-pushdown-disabled 1.00 339.2±0.83ms ? ?/sec 1.13 382.8±3.05ms ? ?/sec
tpch_q10/vortex-pushdown-disabled 1.00 165.3±5.40ms ? ?/sec 1.50 247.8±0.57ms ? ?/sec
tpch_q11/vortex-pushdown-disabled 1.00 105.7±3.73ms ? ?/sec 1.11 117.3±7.19ms ? ?/sec
tpch_q12/vortex-pushdown-disabled 1.00 119.1±3.35ms ? ?/sec 1.27 151.5±1.06ms ? ?/sec
tpch_q13/vortex-pushdown-disabled 1.00 156.1±2.57ms ? ?/sec 1.01 157.7±2.25ms ? ?/sec
tpch_q14/vortex-pushdown-disabled 1.00 21.1±0.64ms ? ?/sec 1.34 28.2±0.72ms ? ?/sec
tpch_q16/vortex-pushdown-disabled 1.00 78.0±2.25ms ? ?/sec 1.25 97.6±1.14ms ? ?/sec
tpch_q17/vortex-pushdown-disabled 1.00 307.4±8.72ms ? ?/sec 1.12 342.9±3.81ms ? ?/sec
tpch_q18/vortex-pushdown-disabled 1.00 588.3±48.01ms ? ?/sec 1.35 795.7±81.62ms ? ?/sec
tpch_q19/vortex-pushdown-disabled 1.00 99.8±0.56ms ? ?/sec 1.37 136.9±3.13ms ? ?/sec
tpch_q2/vortex-pushdown-disabled 1.00 77.6±1.22ms ? ?/sec 1.07 83.0±2.37ms ? ?/sec
tpch_q20/vortex-pushdown-disabled 1.00 134.3±7.21ms ? ?/sec 1.05 141.1±2.10ms ? ?/sec
tpch_q21/vortex-pushdown-disabled 1.00 543.7±33.73ms ? ?/sec 1.46 793.8±27.10ms ? ?/sec
tpch_q22/vortex-pushdown-disabled 1.00 66.3±1.76ms ? ?/sec 1.06 70.5±1.44ms ? ?/sec
tpch_q3/vortex-pushdown-disabled 1.00 98.0±2.36ms ? ?/sec 1.01 99.5±1.49ms ? ?/sec
tpch_q4/vortex-pushdown-disabled 1.00 69.5±1.13ms ? ?/sec 1.04 72.6±0.81ms ? ?/sec
tpch_q5/vortex-pushdown-disabled 1.00 153.9±6.25ms ? ?/sec 1.02 157.7±2.30ms ? ?/sec
tpch_q6/vortex-pushdown-disabled 1.00 15.7±0.17ms ? ?/sec 1.01 15.9±0.18ms ? ?/sec
tpch_q7/vortex-pushdown-disabled 1.00 297.5±1.56ms ? ?/sec 1.15 342.7±8.28ms ? ?/sec
tpch_q8/vortex-pushdown-disabled 1.00 135.5±2.94ms ? ?/sec 1.05 142.6±3.56ms ? ?/sec
tpch_q9/vortex-pushdown-disabled 1.00 278.7±19.86ms ? ?/sec 1.14 317.8±11.65ms ? ?/sec
Digging in on q10 and q21, it looks like Arrow's take implementation for StringView does a full utf8 validation, something that the take implementation for normal StringArray doesn't do. If we addressed that upstream, I suspect this would put us at parity or above the Utf8 implementation.
PR upstream for better take kernel: https://github.com/apache/arrow-rs/pull/6168
Looks like this still need C Data interface fixes upstream
https://github.com/apache/arrow-rs/pull/6171
CI won't succeed until this is merged:
- https://github.com/apache/arrow-rs/pull/6171
Benches will continue to be slower than regular Utf8 until this merges:
- https://github.com/apache/arrow-rs/pull/6168
We may also want to consider how we handle arrays that canonicalize to something that doesn't fit in Arrow. It's not crazy to build a dictionary-encoded array that has >2GB of string data in a single buffer. That would add complexity to either our internal logic, or the into_arrow logic.
Discussion upstream on i32/u32 for BinaryView happening at https://github.com/apache/arrow-rs/issues/6172
Alright, the above 2 PRs have merged into arrow-rs, which means we now need to wait for them to make their way into DataFusion to get the pytests passing.
Looks like there's some amount of agreement on the discussion ticket that i32 should actually be used for BinaryView.
We basically have a few options on our end for how we want our VarBinView array to work:
- Use i32 internally
- Continue to use u32 internally, but only support arrays with blocks of size <=2GB
- Continue to use u32 internally, and when we do into_arrow we can zero-copy it to arrow if all blocks are <=2GB, else we repack into multiple blocks
Superceded by #757