[SPARK-48739][SQL] Disable writing collated data to file formats that don't support them in non managed tables
What changes were proposed in this pull request?
Disable writing collated types to data sources that don't support them. However, spark managed tables should still work as the schema is in HMS and not in the file itself.
Why are the changes needed?
Right now, when users write a collated type directly to json, text, orc.. they will not see that collation when reading back.
Does this PR introduce any user-facing change?
No.
How was this patch tested?
Added new UTs
Was this patch authored or co-authored using generative AI tooling?
No.
@cloud-fan Please take a look when you find the time
For internal file source API, I think we can simply update FileFormat#supportDataType in certain formats such as CSV to return false for string with collation. So no new API is needed.
For data source v1, we can add a new API to CreatableRelationProvider, like supportsStringCollation. Ideally this is not needed as we already have CreatableRelationProvider#supportsDataType. But string collation is special as it's still StringType and existing v1 data sources may mistakenly support it if they do case _: StringType => true
UPDATE: actually CreatableRelationProvider#supportsDataType is newly added in spark 4.0 (not released yet). We can change it to not support string with collation, so that all existing v1 sources won't support string with collation, unless they override supportsDataType to explicitly support it.
@cloud-fan
For internal file source API, I think we can simply update FileFormat#supportDataType in certain formats such as CSV to return false for string with collation. So no new API is needed.
This would mean that we wouldn't be able to create spark managed tables with collations for those formats. Is that something that we want to do?
This would mean that we wouldn't be able to create spark managed tables with collations for those formats. Is that something that we want to do?
To confirm the goal of this PR: we want to have a new API for file sources to indicate that a type is supported only with a catalog? I think we should be more specific about this, as there are many APIs to use a file source:
- read a path with a user-specified schema
- write to a path
- create external table with a path
- create managed table
We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!