[draft] Add `LogicalType`, try to support user-defined types
Which issue does this PR close?
Closes #7923 .
Current Pull Request is an Experimental Demo for Validating the Feasibility of Logical Types
Rationale for this change
What changes are included in this PR?
Features
- Create User-Defined Types (UDTs) through SQL, specifying the field types as UDTs during table creation.
- Support the use of
UDTas a function signature inudf/udaf. - Register extension types through the
register_data_typefunction in theSessionContext.
New Additions
-
LogicalTypestruct. -
ExtensionTypetrait. Abstraction for extension types. -
TypeSignaturestruct. Uniquely identifies a data type.
Major Changes
- Added
get_data_type(&self, _name: &TypeSignature) -> Option<LogicalType>function to theContextProvidertrait. - In
DFSchema,DFFieldnow usesLogicalType, removing arrowFieldand retaining onlydata_type,nullable,metadatasincedict_id,dict_is_orderedare not necessary at the logical stage. -
ExprSchemableandExprSchemanow useLogicalType. -
astto logical plan conversion now usesLogicalType.
To Be Implemented
-
TypeCoercionRewriterin the analyze stage uses logical types. For example, functions likecomparison_coercion,get_input_types,get_valid_types, etc. - Functions signatures for
udf/udafuseTypeSignatureinstead of the existingDataTypefor ease of use inudf/udaf.
To Be Determined
- Should
ScalarValueuseLogicalTypeor arrowDataType?- [ ]
LogicalType. - [ ]
DataType
- [ ]
- Should
TableSourcereturnDFSchemaor arrowSchema?- [ ]
Schema. - [ ]
DFSchema
- [ ]
- Conversion between physical types and logical types (in Datafusion, type conversion is achieved through the conversion of
DFSchematoSchema; logical plans useDFSchema, physical plans useSchema). - Conversion between
SchemaandDFSchema- When to convert
SchematoDFSchema?- [ ] During the construction of the logical
TableScannode, obtain arrowSchemathroughTableSource/TableProviderand then convert it toDFSchema. - [ ]
TableSource/TableProviderreturnsDFSchemainstead ofSchema.
- [ ] During the construction of the logical
- When to convert
DFSchematoSchema?- [ ] Directly obtain arrow
SchemafromTableSourcein physical planner, no need for conversion. - [ ] Convert the
DFSchemareturned byTableSourcetoSchemain the physical planner stage.
- [ ] Directly obtain arrow
- When to convert
Some Thoughts
- In this comment, the use case of converting from
dyn ArraytoLineStringArrayorMultiPointArraywas raised. In my perspective, assuming there is a function specifically designed for handlingLineStringdata, the function signature can be defined asLineString, ensuring that the input data must be of a type acceptable byLineStringArray.
Are these changes tested?
Are there any user-facing changes?
Current PR has some unresolved issues requiring collaboration for discussion. Once there is a consensus on all the issues among the team, I will reorganize the PR accordingly.
I've organized the logic for the mutual conversion between DFSchema and Schema in datafusion. In theory, there should be no conversion logic from Schema to DFSchema. I've outlined all the modifications below.
DFSchema to Schema
No need to change
DefaultPhysicalPlanner
- DescribeTable
- Values -> ValuesExec
- EmptyRelation -> EmptyExec
- Unnest -> UnnestExec
- CopyTo
- Explain
- Analyze
To be changed
-
[ ] TableProvider::schema
- [ ] ViewTable
- [ ] ListingTable
- [ ] EmptyTable
- [ ] MemTable
- [ ] StreamingTable
-
[ ] DataFrame
- [x] write_table: replace with DFSchema
- [ ] cache: build MemTable
Schema to DFSchema (To be changed)
- [x] LogicalPlanBuilder::insert_into: can directly use DFSchema
- [x] LogicalPlanBuilder::explain: can directly use DFSchema
- [x] ConstEvaluator: construct DFSchema then to Schema
- [x] SqlToRel::explain_to_plan: output schema can directly use DFSchema
- [x] SqlToRel::describe_table_to_plan: output schema can directly use DFSchema
- [ ] SqlToRel::insert_to_plan: depends on
table_source.schema() - [ ] SqlToRel::delete_to_plan: depends on
table_source.schema() - [ ] ListingTable::scan: used to create_physical_expr
Thanks @yukkit -- I plan to give this a look, but probably will not have time until tomorrow
What's the status of this pr? This should be a very useful feature.
I think this PR is stalled and I don't have any update
Please accept my apologies for the delay. Due to personal circumstances, I have been unable to attend to any work. I will now proceed to resume work on this PR.
No worries at all -- I hope all is well and we look forward to this work
Thank you for your contribution. Unfortunately, this pull request is stale because it has been open 60 days with no activity. Please remove the stale label or comment or this will be closed in 7 days.
Hello, sorry if this is a redundant question. What is the status of this PR?
Hello, sorry if this is a redundant question. What is the status of this PR?
I think it is stale and on track to be closed from what I can see
Thank you for your contribution. Unfortunately, this pull request is stale because it has been open 60 days with no activity. Please remove the stale label or comment or this will be closed in 7 days.
FYI https://github.com/apache/datafusion/pull/11160 tracks a new proposal for this feature. It seems to be gaining traction