Support for German Umlaute äöü
Description
Typesense treats German Umlaute "ÄÖÜ" like "AOU", so it does not distinguish between them.

Expected Behavior
Treat them as distinct characters, so ü !== u. If I type "Büch" I'd like to see "Bücher" higher than "Buch".
Metadata
Typesense Version: 0.23.1
OS: Linux
+1 but I'd like to add the Swedish "å".
A potential fix for this is in 0.26.0.rc62, when locale is set to de or sv (or any other locales that have umlauts), in the field definition in the collection schema.
Could you give it a shot and confirm if it works as expected now?
Is it autosetting the locale somehow? Seems to work as expected even without setting the collection locale (empty string).
@osfa
No, it won't work without setting the locale. See here: https://gist.github.com/kishorenc/d1c5fcfe8c0b3dd598b6ecdf4c466c3c
We want Ängelholm to be returned first but it's not. If you set sv locale to the title field, it will work as expected.
Why does this require setting a locale? With international data it's expected to have special characters from many languages in the collection. I'd argue that TS should generally match an exact matcher higher than a "it's kinda the same character just without umlauts" match
I'd argue that TS should generally match an exact matcher higher than a "it's kinda the same character just without umlauts" match
We can only index one version of the text: either with accents and umlauts or not. Whatever formatting we do, we have to repeat the same for the query. It's not possible to run both version and prioritize one version over the other because of performance reasons.
So your current fix really just does exact match on umlauts and then only returns non-umlaut matches through typo-correction? Is this the same for characters like é è ê even though they don't exist e.g. in the de locale? Wouldn't it be then better to just have a config that basically says "don't normalize to a-z"?
Sorry for the confusion. We can prioritize one over the other when typo_tokens_threshold is set to a larger number so that typo matches are allowed. Otherwise, since the umlaut is matched exactly only those records are returned. Same behavior for other accented characters when a locale is set.
No, it won't work without setting the locale. See here: https://gist.github.com/kishorenc/d1c5fcfe8c0b3dd598b6ecdf4c466c3c
We want
Ängelholmto be returned first but it's not. If you setsvlocale to thetitlefield, it will work as expected.
ah. in the other build it seemed to have stripped these chars completely though? that behavior seems fixed here, without setting locale, although it doesn't rank them correctly without setting it correctly as you point out.
Yes, without locale, we will still strip those symbols since those don't have a meaning in English and which is the default locale.
Released in v26