Collations proof of concept

Open dbatomic opened this issue 2 years ago • 0 comments

Rough POC for collations in Spark

High level changes

Collation suite that test currently supported features (start with this file).
Global CollatorFactory. This global table represents stores all available collations in the system. It can provide collation aware comparators and hashes to the UTF8String function.
UTF8String is extended with single integer that specifies collation. We could be even more aggressive and pack this integer into a short, or even a byte. Id represents cached comparator id that can be fetched from CollatorFactory.
UTF8String respects this id for equality checks and compares.
StringType class is extended with collationId field. We keep using existing case object StringType to mark default, utf8 binary, collation. By doing this, we should keep binary compatibility with previous versions.
Extending PhysicalStringType with collationId field.
Support for aggregates, given that they currently rely on pure byte for byte comparison for group building.
Support for merge join (hash based joins are TODO).
ICU is used as collation library. ICU4j only exposes UTF16 APIs which means that we need to do UTF8 -> UTF16 conversion on every comparison which is very suboptimal. Alternative is to ICU4c through JNI which does support UTF8 APIs.
Unit benchmark that covers UTF8 operations with and without collators. At this point measured diff between collated and uncollated comparison is ~12x.

Supported features at this point:

collate expression -> input string is casted to StringType(collation).
Collation rules are ICU collator based. Caller provides locale and strength (primary, secondary, tertiary, identical). E.g. collate(input, 'sr-primary') will collate input with Serbian locale that ignores both casing and accents. Secondary will ignore casing but respect accents and tertiary will respect both.
collation expression -> returns collation name of given input.
Support for basic operators (filters, aggregate, joins, views, inline tables etc.).
Support for CREATE TABLE(a STRING COLLATE x) syntax.
Basic parquet support by disabling filter pushdown.

Proper testing (and creating real test strategy is TBD).

TBD is (proper) parquet and delta support, different collation levels (column level, table level, database level) and much more extensive testing of other features.

Suggestion for reviewers of this POC is to start with CollationSuite and newly tests in UTF8StringSuite to get the gist of the changes in this PR.

Dec 29 '23 12:12 dbatomic