hive icon indicating copy to clipboard operation
hive copied to clipboard

HIVE-28483: Fix for casting invalid date give wrong result instead of…

Open zratkai opened this issue 1 year ago • 1 comments

Fix for casting invalid date give wrong result instead of null

Change-Id: Iffee3e070fadbbfca83adb98561e95f216f7722e

What changes were proposed in this pull request?

Why are the changes needed?

Casting invalid date may gave back strange date instead of null.

Does this PR introduce any user-facing change?

Yes, it gives back null if casting an invalid date, instead of a strange result.

Is the change a dependency upgrade?

No.

How was this patch tested?

Unit test.

zratkai avatar Aug 29 '24 10:08 zratkai

@zabetak @okumin I replied in the ticket, here I copy it as well:

I can undestand this is documented,but I have doubts that what a data analyst would expect when giving an input like:  "03-08-2024" and get a result "0003-08-20", and say just drop the "24" from the string, and say it must be some garbage.  This issue comes from a real user who is complaining it is a bug, and I totally agree with it.

As a human I can say the "03-08-2024" is a reverse of "2024-08-03" and it meant 3rd of August 2024, and not 20th of August the year of 3, and 24 must be some kind of garbage. Saying, we just skip the 24 and act like there is nothing is absolutely misleading, I think there is no user who would expect that.

If a string do not conform with the default pattern the best is give back null, to make it clear that it does not conform with the pattern, instead giving back some strange date is absolutely misleading, and error-prone.

Is there any other system which acts like that? I have never seen any.

zratkai avatar Sep 09 '24 11:09 zratkai

I personally second @zratkai 's opinion. We have three options when the CAST function meets unreasonable inputs.

  1. Produce unreasonable output
  2. Produce NULL
  3. Let it crush

I think (3) is not Hive's option since it is known that Hive's CAST is very generous. A strict choice is inconsistent with CAST between other types. Comparing (1) and (2), I think (2) is more user-friendly.

Thanks

okumin avatar Sep 09 '24 12:09 okumin

The ticket which is referred : "HIVE-27586: Parse dates from strings ignoring trailing (potentially) invalid chars" is about ignoring trailing (potentially) characters, and not about truncating valid, but reversely formatted dates, which is a bug here reported by a customer. In the PR description it is even not mentioning this behavior.

Could you please be more specific what you mean by this? "The changes lead to behavior changes in various SQL functions."

All the test are green, so I do not see what SQL functions would break.

If a user likes to convert from the not standard format, it can be done with CAST( AS DATE FORMAT ) command, but the default one should be not something which is not now across the tech industry. Could you please mention any language which supports this format as default?

zratkai avatar Sep 10 '24 08:09 zratkai

@zabetak @okumin Please check the updated ticket description. I made a research on other DB systems (Postgre, Oracle, Mysql). None of them give back error-prone result, instead exception or null.

zratkai avatar Sep 10 '24 10:09 zratkai

In the PR description it is even not mentioning this behavior.

In HIVE-27586, there is an explicit explanation about dates of the form 03-08-2023 and documents how these are treated after the changes there. There were also explicit unit tests added to guard against accidental changes.

Could you please be more specific what you mean by this? "The changes lead to behavior changes in various SQL functions."

The changes in the parser incurred by this PR will affect at least the same SQL functions that are mentioned under HIVE-27586. Some test cases were modified by this PR and new limits were introduced so there are behavior changes.

I made a research on other DB systems (Postgre, Oracle, Mysql). None of them give back error-prone result, instead exception or null.

Thanks for doing that! I fully agree with you that the default date format should be consistent across SQL engines. Ideally, I would like the parser to be very strict and only accept dates in the form 2023-08-03 (YYYY-MM-DD) but this behavior was not appreciated by all users and this is why HIVE-27586 came to the picture.

zabetak avatar Sep 10 '24 12:09 zabetak

@zabetak Is there anything else apart from the debate here and in the JIRA ticket holding off this PR? I want to get a feeling what is pending to finalize this work to see how to advance.

zratkai avatar Sep 17 '24 15:09 zratkai

@zratkai It seems that we are close in reaching consensus. I suggest to wait 48 hours to allow people to review all new elements and then merge it.

zabetak avatar Sep 19 '24 09:09 zabetak