Enable schema evolution for nested structs via adapt_column and custom adapter support in ListingTable
Which issue does this PR close?
This is the last of a series of PRs re-implementing #15295 to close #14757 by adding schema-evolution support for:
-
listing‐based tables
-
with nested structs in DataFusion.
-
Closes #14757.
Rationale for this change
Current schema evolution support in DataFusion does not handle nested struct fields, limiting the flexibility of reading evolving data formats like Parquet and JSON. This PR introduces a robust mechanism to adapt nested structures between file and table schemas, enabling safer and more dynamic schema handling.
What changes are included in this PR?
-
New
adapt_columnutility indatafusion_common::nested_struct:- Recursively adapts struct arrays to match target field types, used across adapters for nested schemas.
-
New
nested_schema_adaptermodule indatafusion_datasource:-
NestedStructSchemaAdapterandNestedStructSchemaAdapterFactoryprovide tailored support for struct field evolution.
-
- Integrated
SchemaAdapterFactoryinto:-
ListingTableConfigandListingTable, allowing custom adapters to be injected at config/build time. - Execution and statistics collection paths to support schema adaptation during scan planning and file listing.
-
- Added detailed tests:
- Validate
NestedStructSchemaAdapterwith evolving schemas and statistics mapping. - Verify fallback and error propagation behaviors for incompatible schemas.
- Ensure coverage of adapter selection and
map_batch/map_column_statisticspaths.
- Validate
Are these changes tested?
Yes, thoroughly:
- Unit tests for
adapt_columncovering deeply nested struct adaptation. - Adapter selection logic tests ensuring appropriate fallback to default or nested adapter.
- Integration tests for
ListingTableusing JSON with schema drift. - Robust testing of error propagation and column statistics correctness.
Are there any user-facing changes?
Yes:
- Users can now provide a custom
SchemaAdapterFactorywhen constructing aListingTableusingwith_schema_adapter_factory. -
datafusion_common::nested_struct::adapt_columnis now available as a utility to help with custom schema adaptation logic. - Native support for nested struct schema evolution is enabled by default via
NestedStructSchemaAdapterFactory.
Sorry I have seen this one but haven't found time to review it yet
cc @adriangb and @timsaucer
I'll try to review tomorrow.
I took a look the other day and my thought was that while it's complex code that is a bit hard for me to fully wrap my head around it's well tested and isolated such that I highly doubt it would break any existing functionality, and if I understand correctly the functionality it does touch currently does not work at all, so it will be strictly an improvement.
However, I am concerned with how this is proposed to be wired in (it seems like it isn't connected to anything 🤔 )
@alamb I asked @kosiew to move that out of the PR to keep it a bit smaller. The plan is to wire it up to ListingTable. I know we want to keep ListingTable simple but it's a relatively small change: https://github.com/apache/datafusion/pull/16371#discussion_r2157585906