Spark: support hilbert curve when rewrite
The data aggregation of Hilbert curve is better, this pr will support.
The data aggregation of Hilbert curve is better, this pr will support.
Can we take a look at it,thanks! @ajantha-bhat @RussellSpitzer @rdblue @kbendick
hi @linfey90 Overall implementation is good.
@RussellSpitzer: Would you like to take a look at this?
Ok, I'll try to do that later. Thank you for your reply.
Hi, @linfey90 I have a question want to know, here how I use zorder in my code
actions().rewriteDataFiles(table).zOrder(columns).option("xxx","xxx")
if you add hilbert, it only to change zOrder to hilbert to use it like
actions().rewriteDataFiles(table).HilbertCurve(columns).options("xxx","xxx") or something else?
yes,that's right.
Hilbert has better data aggregation.Here is a simple performance test. 1: prepare a parquet table which has One hundred million rows, and 11 columns. and has two column name c1 and c2.the values is range from 0 to 500000. the flinksql like, CREATE TABLE default_catalog.default_database.dg ( c1 INT, c2 bigint c3 VARCHAR, c4 VARCHAR, c5 TINYINT, c6 SMALLINT, c7 FLOAT, c8 double, c9 char, c10 boolean, c11 AS localtimestamp ) WITH ( 'connector' = 'datagen', 'fields.c3.length' = '10', 'fields.c4.length' = '10', 'fields.c1.min' = '0', 'fields.c1.max' = '1000000', 'fields.c2.min' = '0', 'fields.c2.max' = '1000000', 'rows-per-second' = '30000', 'number-of-rows' ='100000000' ); 2: Create two tables, test_zorder and test_hilbert, and copy the above data. 3: rewrite the table by sort c1,c2 with zorder and hilbert. 4: Write code to view the number of file skips, and execute the sql like select count.
| query condition | table | file skip | total Files | file Skip percentage | query time |
|---|---|---|---|---|---|
| c1 <500000 and c2 < 500000 | hilbert | 97 | 171 | 56.7% | 1.018s |
| zorder | 82 | 180 | 45.56% | 1.353s | |
| c1 >500000 and c2 > 500000 | hilbert | 28 | 171 | 16.37% | 3.337s |
| zorder | 18 | 180 | 10% | 3.37s |
note:The query time depends on the cluster environment and is for reference only. But file skip is stable.