spark icon indicating copy to clipboard operation
spark copied to clipboard

[SPARK-48739][SQL] Disable writing collated data to file formats that don't support them in non managed tables

Open stefankandic opened this issue 1 year ago • 2 comments

What changes were proposed in this pull request?

Disable writing collated types to data sources that don't support them. However, spark managed tables should still work as the schema is in HMS and not in the file itself.

Why are the changes needed?

Right now, when users write a collated type directly to json, text, orc.. they will not see that collation when reading back.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Added new UTs

Was this patch authored or co-authored using generative AI tooling?

No.

stefankandic avatar Jun 27 '24 15:06 stefankandic

@cloud-fan Please take a look when you find the time

stefankandic avatar Jun 27 '24 20:06 stefankandic

For internal file source API, I think we can simply update FileFormat#supportDataType in certain formats such as CSV to return false for string with collation. So no new API is needed.

For data source v1, we can add a new API to CreatableRelationProvider, like supportsStringCollation. Ideally this is not needed as we already have CreatableRelationProvider#supportsDataType. But string collation is special as it's still StringType and existing v1 data sources may mistakenly support it if they do case _: StringType => true

UPDATE: actually CreatableRelationProvider#supportsDataType is newly added in spark 4.0 (not released yet). We can change it to not support string with collation, so that all existing v1 sources won't support string with collation, unless they override supportsDataType to explicitly support it.

cloud-fan avatar Jun 28 '24 02:06 cloud-fan

@cloud-fan

For internal file source API, I think we can simply update FileFormat#supportDataType in certain formats such as CSV to return false for string with collation. So no new API is needed.

This would mean that we wouldn't be able to create spark managed tables with collations for those formats. Is that something that we want to do?

stefankandic avatar Jul 01 '24 09:07 stefankandic

This would mean that we wouldn't be able to create spark managed tables with collations for those formats. Is that something that we want to do?

To confirm the goal of this PR: we want to have a new API for file sources to indicate that a type is supported only with a catalog? I think we should be more specific about this, as there are many APIs to use a file source:

  1. read a path with a user-specified schema
  2. write to a path
  3. create external table with a path
  4. create managed table

cloud-fan avatar Jul 01 '24 15:07 cloud-fan

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

github-actions[bot] avatar Oct 10 '24 00:10 github-actions[bot]