spark [SPARK-46841][SQL] Add collation support for ICU locales and collation specifiers

What changes were proposed in this pull request?

Languages and localization for collations are supported by ICU library. Collation naming format is as follows:

<2-letter language code>[_<4-letter script>][_<3-letter country code>][_specifier_specifier...]

Locale specifier consists of the first part of collation name (language + script + country). Locale specifiers need to be stable across ICU versions; to keep existing ids and names invariant we introduce golden file will locale table which should case CI failure on any silent changes.

Currently supported optional specifiers:

CS/CI - case sensitivity, default is case-sensitive; supported by configuring ICU collation levels
AS/AI - accent sensitivity, default is accent-sensitive; supported by configuring ICU collation levels
<unspecified>/LCASE/UCASE - case conversion performed prior to comparisons, default is unspecified; supported by internal implementation relying on ICU locale-aware conversions

User can use collation specifiers in any order except of locale which is mandatory and must go first. There is a one-to-one mapping between collation ids and collation names defined in CollationFactory.

Why are the changes needed?

To add languages and localization support for collations.

Does this PR introduce any user-facing change?

Yes, it adds new predefined collations.

How was this patch tested?

Added checks to CollationFactorySuite and ICU locale map golden file.

Was this patch authored or co-authored using generative AI tooling?

No.

Apr 23 '24 08:04 nikolamand-db

Please review collation team @dbatomic @stefankandic @uros-db @mihailom-db @stevomitric.

Apr 24 '24 15:04 nikolamand-db

User can use collation specifiers in any order except of locale which is mandatory and must go first. There is a one-to-one mapping between collation ids and collation names defined in CollationFactory.

Does that mean that there are multiple collation names that have identical meaning but different names (with the specifiers in different order)? That seems confusing and error prone. Are we at least normalizing them internally somehow?

Apr 26 '24 13:04 bart-samwel

will we have to do the same for pyspark - as StringType there only supports 4 initial collations?

May 07 '24 16:05 stefankandic

User can use collation specifiers in any order except of locale which is mandatory and must go first. There is a one-to-one mapping between collation ids and collation names defined in CollationFactory.

Does that mean that there are multiple collation names that have identical meaning but different names (with the specifiers in different order)? That seems confusing and error prone. Are we at least normalizing them internally somehow?

+1 on @bart-samwel 's concern. Is there a good reason to allow the order to be relaxed?

May 09 '24 00:05 mkaravel

How do we name a trailing-space-insensitive collation?

May 09 '24 00:05 mkaravel

@mkaravel @dbatomic please review again, thanks.

May 17 '24 06:05 nikolamand-db

do we have end-to-end tests for this new feature?

May 22 '24 11:05 cloud-fan

thanks, merging to master!

May 28 '24 16:05 cloud-fan