iceberg icon indicating copy to clipboard operation
iceberg copied to clipboard

Spark: Support truncate in FunctionCatalog

Open kbendick opened this issue 3 years ago • 5 comments

This is an offshoot of https://github.com/apache/iceberg/pull/5305 and partially closes https://github.com/apache/iceberg/issues/5349.

Adds a system.truncate function that can be used in Spark SQL, as well as can be used as a FunctionCatalog function that can be turned into a transform for storage partitioned joins.

This also breaks the definition of Truncate out into utility functions inside of TruncateUtil. Because different usages validate input at different times, all of the functions in TruncateUtil do not validate their input and instead assume that the input is validated by the calling code. This allows for the Truncate transforms to validate their width one time (on instantiation), and for the Spark truncate function to skip input validation for faster generated code.

kbendick avatar Aug 04 '22 00:08 kbendick

cc @rdblue PTAL when you have time. I've addressed all of your feedback. Thanks!

kbendick avatar Aug 09 '22 21:08 kbendick

cc @nastra and @aokolnychyi as well as you reviewed the FunctionCatalog PR itself 🙂

kbendick avatar Aug 10 '22 04:08 kbendick

I'd love to take a look as well today.

aokolnychyi avatar Aug 11 '22 15:08 aokolnychyi

@aokolnychyi @rdblue I addressed all of your feedback.

Please take a look when you get a chance. Also, on the subject of nullability for the width column, I was able to make it work but reverted it based on this comment thread: https://github.com/apache/iceberg/pull/5431#discussion_r943914954

Knowing that Spark isn't the best at keeping track of nullability, I think this is better and adheres to the contract laid out in ScalarFunction as quoted by @aokolnychyi.

But take a look at the commit that added nullability-checking on the width field if you'd like: https://github.com/apache/iceberg/pull/5431/commits/ad5343facb561750f24e59e8cdba4da882b0313c

kbendick avatar Aug 11 '22 20:08 kbendick

Looks good to me. @aokolnychyi, do you want to take another look?

rdblue avatar Aug 11 '22 23:08 rdblue

@rdblue, let me take a quick now.

aokolnychyi avatar Aug 12 '22 15:08 aokolnychyi

Thanks, @kbendick! Could you cherry-pick this to 3.2?

aokolnychyi avatar Aug 12 '22 15:08 aokolnychyi

Here's the PR for bucket. All the feedback from this PR was more or less applied there as well - https://github.com/apache/iceberg/pull/5513

kbendick avatar Aug 12 '22 18:08 kbendick

Thanks, @kbendick! Could you cherry-pick this to 3.2?

Sure thing!

kbendick avatar Aug 12 '22 18:08 kbendick