datafusion icon indicating copy to clipboard operation
datafusion copied to clipboard

Enable schema evolution for nested structs via adapt_column and custom adapter support in ListingTable

Open kosiew opened this issue 8 months ago • 1 comments

Which issue does this PR close?

This is the last of a series of PRs re-implementing #15295 to close #14757 by adding schema-evolution support for:

  • listing‐based tables

  • with nested structs in DataFusion.

  • Closes #14757.

Rationale for this change

Current schema evolution support in DataFusion does not handle nested struct fields, limiting the flexibility of reading evolving data formats like Parquet and JSON. This PR introduces a robust mechanism to adapt nested structures between file and table schemas, enabling safer and more dynamic schema handling.

What changes are included in this PR?

  • New adapt_column utility in datafusion_common::nested_struct:
    • Recursively adapts struct arrays to match target field types, used across adapters for nested schemas.
  • New nested_schema_adapter module in datafusion_datasource:
    • NestedStructSchemaAdapter and NestedStructSchemaAdapterFactory provide tailored support for struct field evolution.
  • Integrated SchemaAdapterFactory into:
    • ListingTableConfig and ListingTable, allowing custom adapters to be injected at config/build time.
    • Execution and statistics collection paths to support schema adaptation during scan planning and file listing.
  • Added detailed tests:
    • Validate NestedStructSchemaAdapter with evolving schemas and statistics mapping.
    • Verify fallback and error propagation behaviors for incompatible schemas.
    • Ensure coverage of adapter selection and map_batch/map_column_statistics paths.

Are these changes tested?

Yes, thoroughly:

  • Unit tests for adapt_column covering deeply nested struct adaptation.
  • Adapter selection logic tests ensuring appropriate fallback to default or nested adapter.
  • Integration tests for ListingTable using JSON with schema drift.
  • Robust testing of error propagation and column statistics correctness.

Are there any user-facing changes?

Yes:

  • Users can now provide a custom SchemaAdapterFactory when constructing a ListingTable using with_schema_adapter_factory.
  • datafusion_common::nested_struct::adapt_column is now available as a utility to help with custom schema adaptation logic.
  • Native support for nested struct schema evolution is enabled by default via NestedStructSchemaAdapterFactory.

kosiew avatar Jun 11 '25 15:06 kosiew

Sorry I have seen this one but haven't found time to review it yet

cc @adriangb and @timsaucer

alamb avatar Jun 18 '25 21:06 alamb

I'll try to review tomorrow.

I took a look the other day and my thought was that while it's complex code that is a bit hard for me to fully wrap my head around it's well tested and isolated such that I highly doubt it would break any existing functionality, and if I understand correctly the functionality it does touch currently does not work at all, so it will be strictly an improvement.

adriangb avatar Jun 18 '25 23:06 adriangb

However, I am concerned with how this is proposed to be wired in (it seems like it isn't connected to anything 🤔 )

@alamb I asked @kosiew to move that out of the PR to keep it a bit smaller. The plan is to wire it up to ListingTable. I know we want to keep ListingTable simple but it's a relatively small change: https://github.com/apache/datafusion/pull/16371#discussion_r2157585906

adriangb avatar Jun 26 '25 17:06 adriangb