[Coral-Trino] Fix redundant transforms/casts from GenericProjectTransformer
What changes are proposed in this pull request, and why are they necessary?
In GenericProjectTransformer we compare fromDataType vs toDataType to convert the Coral IR function generic_project to its Trino compatible representation.
fromDataType is derived using Coral IR while toDataType is derived entirely through Calcite, both are of type RelDataType. However, #449 introduced a change where Coral IR (RelNodes) are generated using StructKind.PEEK_FIELDS_NO_EXPAND instead of StructKind.FULLY_QUALIFIED.
The problem now is that the comparison now reports 2 structs that are actually the same (same fields) as not equal. For the purposes of GenericProjectTransformer, we should consider the two StructKinds as equivalent since we only care about the struct fields (and whether or not they should be casted/transformed). Once the "inequivalent" structs are reported, the transformer will apply a redundant transform or cast.
For example, before #449, for a view in hive with a column of type users:array<struct<id:int,name:string>>, would simply be selected in the final translation SQL as users.
After #449, the same column in the same view would be rewritten as:
transform(users, x -> cast(row("x"."id", "x"."name") as row("id" int, "name" varchar)))
A transform was never needed in the first place (the generic_project never required it), now a transform is added and all it's doing is transforming a column into it's own original type.
How was this patch tested?
- Selectively tested on production views that were regressing for this reason
- i-tested, expected regressions (back to the state before 449)
In
GenericProjectTransformerwe compare ...
Can we always introduce which part of the pipeline this happens? Example: In LHS, when converting from Hive SqlNode to Coral SqlNode, etc. Heads up for others: @aastha25 @ljfgem @rzhang10
As discussed offline, please check if it is possible to use StructKind.PEEK_FIELDS_NO_EXPAND during all RelNode creation steps in the translation pipeline so that we dont have StructKind.FULLY_QUALIFIED type struct fields anywhere and hence no mismatch in the first place. String comparison for for different StructKind looks fragile. We could probably reuse CoralJavaTypeFactoryImpl and HiveRelBuilder.
As discussed offline, please check if it is possible to use StructKind.PEEK_FIELDS_NO_EXPAND during all RelNode creation steps
What are the "all RelNode creation steps"? How to identify them?
As discussed offline, please check if it is possible to use StructKind.PEEK_FIELDS_NO_EXPAND during all RelNode creation steps
What are the "all RelNode creation steps"? How to identify them?
There's 2 places off the top of my head: a) on the LHS: Coral SqlNode1 ---> CoralRelNode. b) During type derivation .
I suspect this is the reason why fromDataType and toDataType don't match here https://github.com/linkedin/coral/pull/512/files#diff-5bd7b847ded0e80b31c380003182e5f91b9e8c2e202ac25d3ae2dff0a76c5b48R206
I think if we want to do this, we might want to always leverage a RelBuilder like this (which actually triggers the PEEK_FIELDS_NO_EXPAND path).
Also if we do it, I think we should do it only if it make sense semantically in the first place, with plausible explanation (as opposed to doing it solely to address the incompatibility check).
The issue has less to do with RelNode creation (which consistently uses the same TypeFactory across Coral). Issue lies in how we create the expected data type or toDataType, updated PR description to explain.