iceberg truncate partitioning underflows, leads to wrong results

Apache Iceberg version

1.4.3 (latest release)

Query engine

Spark, but probably all of them.

Please describe the bug 🐞

The truncate partition transform can underflow for all numeric types (decimal, int, long), leading to it no longer being order preservering. I.e., a very small value gets assigned to a very large bucket, leading to partition pruning incorrectly pruning this bucket and therefore queries to return wrong results.

Also, due to the underflow, the bucket is not even a correct bucket, as it is not a remainder of the truncate width.

For example, integer value -2147483648 with truncate width 10000 gets put into bucket 2147477296.

Spark SQL repro:

CREATE TABLE iceberg.sample (x integer) PARTITIONED BY(truncate(x,10000))
TBLPROPERTIES(`format-version`=2);

INSERT INTO iceberg.sample VALUES(-2147483648);

-- Returns empty result, since the value is put into bucket 2147477296
SELECT * FROM iceberg.sample WHERE x < 0

I guess truncate partitioning should check for underflow and in this case put the value into the lowest possible bucket. The transform needs to be clarified in the specification as well.

Feb 20 '24 19:02 JFinis

This issue has been automatically marked as stale because it has been open for 180 days with no activity. It will be closed in next 14 days if no further activity occurs. To permanently prevent this issue from being considered stale, add the label 'not-stale', but commenting on the issue is preferred when possible.

Oct 21 '24 00:10 github-actions[bot]

This issue has been closed because it has not received any activity in the last 14 days since being marked as 'stale'

Nov 05 '24 00:11 github-actions[bot]