spark
spark copied to clipboard
Collations proof of concept
Rough POC for collations in Spark
High level changes
- Collation suite that test currently supported features (start with this file).
- Global
CollatorFactory. This global table represents stores all available collations in the system. It can provide collation aware comparators and hashes to theUTF8Stringfunction. -
UTF8Stringis extended with single integer that specifies collation. We could be even more aggressive and pack this integer into a short, or even a byte. Id represents cached comparator id that can be fetched fromCollatorFactory. -
UTF8Stringrespects this id for equality checks and compares. -
StringTypeclass is extended with collationId field. We keep using existingcase object StringTypeto mark default, utf8 binary, collation. By doing this, we should keep binary compatibility with previous versions. - Extending
PhysicalStringTypewithcollationIdfield. - Support for aggregates, given that they currently rely on pure byte for byte comparison for group building.
- Support for merge join (hash based joins are TODO).
- ICU is used as collation library. ICU4j only exposes UTF16 APIs which means that we need to do UTF8 -> UTF16 conversion on every comparison which is very suboptimal. Alternative is to ICU4c through JNI which does support UTF8 APIs.
- Unit benchmark that covers UTF8 operations with and without collators. At this point measured diff between collated and uncollated comparison is ~12x.
Supported features at this point:
-
collateexpression -> input string is casted toStringType(collation). - Collation rules are ICU collator based. Caller provides locale and strength (primary, secondary, tertiary, identical). E.g.
collate(input, 'sr-primary')will collate input with Serbian locale that ignores both casing and accents. Secondary will ignore casing but respect accents and tertiary will respect both. -
collationexpression -> returns collation name of given input. - Support for basic operators (filters, aggregate, joins, views, inline tables etc.).
- Support for
CREATE TABLE(a STRING COLLATE x)syntax. - Basic parquet support by disabling filter pushdown.
Proper testing (and creating real test strategy is TBD).
TBD is (proper) parquet and delta support, different collation levels (column level, table level, database level) and much more extensive testing of other features.
Suggestion for reviewers of this POC is to start with CollationSuite and newly tests in UTF8StringSuite to get the gist of the changes in this PR.