spark icon indicating copy to clipboard operation
spark copied to clipboard

[SPARK-40296] Error class for DISTINCT function not found

Open amaliujia opened this issue 3 years ago • 5 comments

What changes were proposed in this pull request?

When users have a query like SELECT SUM(a), DISTINCT(a, b) FROM table, the DISTINCT is treated as a function but cannot be found with the following message:

Undefined function: distinct. This function is neither a built-in/temporary function, nor a persistent function that is qualified as hive_metastore.default.distinct

We can standardize this error message for the case of distinct as users might want to use distinct to deduplicate arguments but thought it is an aggregation functions. The new error message should remind proper actions for fix distinct syntax issue.

Why are the changes needed?

Improve user experience.

Does this PR introduce any user-facing change?

No

How was this patch tested?

UT

amaliujia avatar Sep 01 '22 01:09 amaliujia

R: @cloud-fan

amaliujia avatar Sep 01 '22 01:09 amaliujia

cc @MaxGekk

cloud-fan avatar Sep 01 '22 13:09 cloud-fan

R @MaxGekk

amaliujia avatar Sep 01 '22 17:09 amaliujia

Can one of the admins verify this patch?

AmplabJenkins avatar Sep 02 '22 18:09 AmplabJenkins

Converting to draft because we just realize SELECT DISTINCT(c1, c2) works but it converts c1 and c2 to a struct then distinct on the struct and the output is a one column (so not two columns). We need more discussion for how to deal with this situation.

amaliujia avatar Sep 02 '22 18:09 amaliujia

Because Spark supports SELECT distinct(col1, col2) (and the return is a struct of co1 and col2), which makes this error message proposal complicated.

Because now we cannot say that SELECT SUM(a1), distinct(c1, c2) is not a valid usage, even though this is not supported by the parser, because users might have been implied that distinct(col list) is a valid aggregate like usage. We at least need a better way to distinguish 1) select distinct col_list 2) select distinct(col_list) 3) select agg1, agg2, distinct(col_list)`.

Given the reason above, I will close this PR and defer the handling for the distinct error described in this PR.

amaliujia avatar Sep 26 '22 19:09 amaliujia