spark icon indicating copy to clipboard operation
spark copied to clipboard

[SPARK-46841][SQL] Add collation support for ICU locales and collation specifiers

Open nikolamand-db opened this issue 1 year ago • 5 comments

What changes were proposed in this pull request?

Languages and localization for collations are supported by ICU library. Collation naming format is as follows:

<2-letter language code>[_<4-letter script>][_<3-letter country code>][_specifier_specifier...]

Locale specifier consists of the first part of collation name (language + script + country). Locale specifiers need to be stable across ICU versions; to keep existing ids and names invariant we introduce golden file will locale table which should case CI failure on any silent changes.

Currently supported optional specifiers:

  • CS/CI - case sensitivity, default is case-sensitive; supported by configuring ICU collation levels
  • AS/AI - accent sensitivity, default is accent-sensitive; supported by configuring ICU collation levels
  • <unspecified>/LCASE/UCASE - case conversion performed prior to comparisons, default is unspecified; supported by internal implementation relying on ICU locale-aware conversions

User can use collation specifiers in any order except of locale which is mandatory and must go first. There is a one-to-one mapping between collation ids and collation names defined in CollationFactory.

Why are the changes needed?

To add languages and localization support for collations.

Does this PR introduce any user-facing change?

Yes, it adds new predefined collations.

How was this patch tested?

Added checks to CollationFactorySuite and ICU locale map golden file.

Was this patch authored or co-authored using generative AI tooling?

No.

nikolamand-db avatar Apr 23 '24 08:04 nikolamand-db

Please review collation team @dbatomic @stefankandic @uros-db @mihailom-db @stevomitric.

nikolamand-db avatar Apr 24 '24 15:04 nikolamand-db

User can use collation specifiers in any order except of locale which is mandatory and must go first. There is a one-to-one mapping between collation ids and collation names defined in CollationFactory.

Does that mean that there are multiple collation names that have identical meaning but different names (with the specifiers in different order)? That seems confusing and error prone. Are we at least normalizing them internally somehow?

bart-samwel avatar Apr 26 '24 13:04 bart-samwel

will we have to do the same for pyspark - as StringType there only supports 4 initial collations?

stefankandic avatar May 07 '24 16:05 stefankandic

User can use collation specifiers in any order except of locale which is mandatory and must go first. There is a one-to-one mapping between collation ids and collation names defined in CollationFactory.

Does that mean that there are multiple collation names that have identical meaning but different names (with the specifiers in different order)? That seems confusing and error prone. Are we at least normalizing them internally somehow?

+1 on @bart-samwel 's concern. Is there a good reason to allow the order to be relaxed?

mkaravel avatar May 09 '24 00:05 mkaravel

How do we name a trailing-space-insensitive collation?

mkaravel avatar May 09 '24 00:05 mkaravel

@mkaravel @dbatomic please review again, thanks.

nikolamand-db avatar May 17 '24 06:05 nikolamand-db

do we have end-to-end tests for this new feature?

cloud-fan avatar May 22 '24 11:05 cloud-fan

thanks, merging to master!

cloud-fan avatar May 28 '24 16:05 cloud-fan