iceberg icon indicating copy to clipboard operation
iceberg copied to clipboard

The truncate partition transform is underspecified

Open JFinis opened this issue 2 years ago • 1 comments

Apache Iceberg version

1.4.3 (latest release)

Query engine

None; it's a Spec issue

Please describe the bug 🐞

The spec does not clearly define how the truncate partition transform should behave when truncating the value would lead to an underflow. Right now, the implementation just undeflows and thus yields wrong query results, which I reported in a separate issue.

I am reporting the fact that the transform is underspecified as a seperate issue, as it is not enough to just fix the implementation. The Iceberg specification also has to be updated to precisely specify what should happen. Otherwise, different Iceberg implementations will do different things here.

The best fix is probably to define that in case of an underflow during truncation, the value should land in the smallest possible bucket. It is also currently not clearly defined what that bucket even is, so this would be needed to be defined as well. IMHO, it would make sense if the minimum value would be this bucket (Integer.MIN_VALUE, Long.MIN_VALUE, and whatever the equivalent is for 128 bit decimals). Note that this value might not be a correctly truncated bucket, but it is still the only sensible value to choose.

For example, Integer.MIN_VALUE = -2147483648 with truncate width 100 is not a "correctly truncated" bucket, since only -2147483600 and -2147483700 would be. However, we cannot choose -2147483700, as this one is an underflow and not representable in int. We also shouldn't choose -2147483600, as this is simply not the correct bucket for the value, since the value is smaller than this and a bucket should never contain values smaller than it (otherwise it's not a truncation). Thus, the only possible value to choose in this circumstance is indeed Integer.MIN_VALUE, even thought it's not a full truncation.

JFinis avatar Feb 20 '24 19:02 JFinis

This issue has been automatically marked as stale because it has been open for 180 days with no activity. It will be closed in next 14 days if no further activity occurs. To permanently prevent this issue from being considered stale, add the label 'not-stale', but commenting on the issue is preferred when possible.

github-actions[bot] avatar Oct 21 '24 00:10 github-actions[bot]

This issue has been closed because it has not received any activity in the last 14 days since being marked as 'stale'

github-actions[bot] avatar Nov 05 '24 00:11 github-actions[bot]