iceberg Spark: support hilbert curve when rewrite

The data aggregation of Hilbert curve is better, this pr will support.

Sep 22 '22 07:09 linfey90

The data aggregation of Hilbert curve is better, this pr will support.

Can we take a look at it,thanks! @ajantha-bhat @RussellSpitzer @rdblue @kbendick

Sep 22 '22 07:09 linfey90

hi @linfey90 Overall implementation is good.

Sep 23 '22 01:09 XuQianJin-Stars

@RussellSpitzer: Would you like to take a look at this?

Nov 14 '22 05:11 ajantha-bhat

Ok, I'll try to do that later. Thank you for your reply.

Nov 15 '22 06:11 linfey90

Hi, @linfey90 I have a question want to know, here how I use zorder in my code actions().rewriteDataFiles(table).zOrder(columns).option("xxx","xxx") if you add hilbert, it only to change zOrder to hilbert to use it like actions().rewriteDataFiles(table).HilbertCurve(columns).options("xxx","xxx") or something else?

Nov 26 '22 19:11 LevisBale0824

yes,that's right.

Nov 30 '22 02:11 linfey90

Hilbert has better data aggregation.Here is a simple performance test. 1: prepare a parquet table which has One hundred million rows, and 11 columns. and has two column name c1 and c2.the values is range from 0 to 500000. the flinksql like, CREATE TABLE default_catalog.default_database.dg ( c1 INT, c2 bigint c3 VARCHAR, c4 VARCHAR, c5 TINYINT, c6 SMALLINT, c7 FLOAT, c8 double, c9 char, c10 boolean, c11 AS localtimestamp ) WITH ( 'connector' = 'datagen', 'fields.c3.length' = '10', 'fields.c4.length' = '10', 'fields.c1.min' = '0', 'fields.c1.max' = '1000000', 'fields.c2.min' = '0', 'fields.c2.max' = '1000000', 'rows-per-second' = '30000', 'number-of-rows' ='100000000' ); 2: Create two tables, test_zorder and test_hilbert, and copy the above data. 3: rewrite the table by sort c1,c2 with zorder and hilbert. 4: Write code to view the number of file skips, and execute the sql like select count.

query condition	table	file skip	total Files	file Skip percentage	query time
c1 <500000 and c2 < 500000	hilbert	97	171	56.7%	1.018s
	zorder	82	180	45.56%	1.353s
c1 >500000 and c2 > 500000	hilbert	28	171	16.37%	3.337s
	zorder	18	180	10%	3.37s

note:The query time depends on the cluster environment and is for reference only. But file skip is stable.

Dec 28 '22 06:12 linfey90