iceberg icon indicating copy to clipboard operation
iceberg copied to clipboard

Spark: support hilbert curve when rewrite

Open linfey90 opened this issue 3 years ago • 2 comments

The data aggregation of Hilbert curve is better, this pr will support.

linfey90 avatar Sep 22 '22 07:09 linfey90

The data aggregation of Hilbert curve is better, this pr will support.

Can we take a look at it,thanks! @ajantha-bhat @RussellSpitzer @rdblue @kbendick

linfey90 avatar Sep 22 '22 07:09 linfey90

hi @linfey90 Overall implementation is good.

XuQianJin-Stars avatar Sep 23 '22 01:09 XuQianJin-Stars

@RussellSpitzer: Would you like to take a look at this?

ajantha-bhat avatar Nov 14 '22 05:11 ajantha-bhat

Ok, I'll try to do that later. Thank you for your reply.

linfey90 avatar Nov 15 '22 06:11 linfey90

Hi, @linfey90 I have a question want to know, here how I use zorder in my code actions().rewriteDataFiles(table).zOrder(columns).option("xxx","xxx") if you add hilbert, it only to change zOrder to hilbert to use it like actions().rewriteDataFiles(table).HilbertCurve(columns).options("xxx","xxx") or something else?

LevisBale0824 avatar Nov 26 '22 19:11 LevisBale0824

yes,that's right.

linfey90 avatar Nov 30 '22 02:11 linfey90

Hilbert has better data aggregation.Here is a simple performance test. 1: prepare a parquet table which has One hundred million rows, and 11 columns. and has two column name c1 and c2.the values is range from 0 to 500000. the flinksql like, CREATE TABLE default_catalog.default_database.dg ( c1 INT, c2 bigint c3 VARCHAR, c4 VARCHAR, c5 TINYINT, c6 SMALLINT, c7 FLOAT, c8 double, c9 char, c10 boolean, c11 AS localtimestamp ) WITH ( 'connector' = 'datagen', 'fields.c3.length' = '10', 'fields.c4.length' = '10', 'fields.c1.min' = '0', 'fields.c1.max' = '1000000', 'fields.c2.min' = '0', 'fields.c2.max' = '1000000', 'rows-per-second' = '30000', 'number-of-rows' ='100000000' ); 2: Create two tables, test_zorder and test_hilbert, and copy the above data. 3: rewrite the table by sort c1,c2 with zorder and hilbert. 4: Write code to view the number of file skips, and execute the sql like select count.

query condition table file skip total Files file Skip percentage query time
c1 <500000 and c2 < 500000 hilbert 97 171 56.7% 1.018s
zorder 82 180 45.56% 1.353s
c1 >500000 and c2 > 500000 hilbert 28 171 16.37% 3.337s
zorder 18 180 10% 3.37s

note:The query time depends on the cluster environment and is for reference only. But file skip is stable.

linfey90 avatar Dec 28 '22 06:12 linfey90