[SPARK-46841][SQL] Add collation support for ICU locales and collation specifiers
What changes were proposed in this pull request?
Languages and localization for collations are supported by ICU library. Collation naming format is as follows:
<2-letter language code>[_<4-letter script>][_<3-letter country code>][_specifier_specifier...]
Locale specifier consists of the first part of collation name (language + script + country). Locale specifiers need to be stable across ICU versions; to keep existing ids and names invariant we introduce golden file will locale table which should case CI failure on any silent changes.
Currently supported optional specifiers:
-
CS/CI- case sensitivity, default is case-sensitive; supported by configuring ICU collation levels -
AS/AI- accent sensitivity, default is accent-sensitive; supported by configuring ICU collation levels -
<unspecified>/LCASE/UCASE- case conversion performed prior to comparisons, default is unspecified; supported by internal implementation relying on ICU locale-aware conversions
User can use collation specifiers in any order except of locale which is mandatory and must go first. There is a one-to-one mapping between collation ids and collation names defined in CollationFactory.
Why are the changes needed?
To add languages and localization support for collations.
Does this PR introduce any user-facing change?
Yes, it adds new predefined collations.
How was this patch tested?
Added checks to CollationFactorySuite and ICU locale map golden file.
Was this patch authored or co-authored using generative AI tooling?
No.
Please review collation team @dbatomic @stefankandic @uros-db @mihailom-db @stevomitric.
User can use collation specifiers in any order except of locale which is mandatory and must go first. There is a one-to-one mapping between collation ids and collation names defined in CollationFactory.
Does that mean that there are multiple collation names that have identical meaning but different names (with the specifiers in different order)? That seems confusing and error prone. Are we at least normalizing them internally somehow?
will we have to do the same for pyspark - as StringType there only supports 4 initial collations?
User can use collation specifiers in any order except of locale which is mandatory and must go first. There is a one-to-one mapping between collation ids and collation names defined in CollationFactory.
Does that mean that there are multiple collation names that have identical meaning but different names (with the specifiers in different order)? That seems confusing and error prone. Are we at least normalizing them internally somehow?
+1 on @bart-samwel 's concern. Is there a good reason to allow the order to be relaxed?
How do we name a trailing-space-insensitive collation?
@mkaravel @dbatomic please review again, thanks.
do we have end-to-end tests for this new feature?
thanks, merging to master!