typesense icon indicating copy to clipboard operation
typesense copied to clipboard

Support for German Umlaute äöü

Open sbleh opened this issue 3 years ago • 1 comments

Description

Typesense treats German Umlaute "ÄÖÜ" like "AOU", so it does not distinguish between them.

cd8619e97406993c4e26f6e5c972b3a7

Expected Behavior

Treat them as distinct characters, so ü !== u. If I type "Büch" I'd like to see "Bücher" higher than "Buch".

Metadata

Typesense Version: 0.23.1

OS: Linux

sbleh avatar Feb 06 '23 07:02 sbleh

+1 but I'd like to add the Swedish "å".

osfa avatar Feb 09 '24 15:02 osfa

A potential fix for this is in 0.26.0.rc62, when locale is set to de or sv (or any other locales that have umlauts), in the field definition in the collection schema.

Could you give it a shot and confirm if it works as expected now?

jasonbosco avatar Mar 06 '24 17:03 jasonbosco

Is it autosetting the locale somehow? Seems to work as expected even without setting the collection locale (empty string).

osfa avatar Mar 06 '24 18:03 osfa

@osfa

No, it won't work without setting the locale. See here: https://gist.github.com/kishorenc/d1c5fcfe8c0b3dd598b6ecdf4c466c3c

We want Ängelholm to be returned first but it's not. If you set sv locale to the title field, it will work as expected.

kishorenc avatar Mar 07 '24 10:03 kishorenc

Why does this require setting a locale? With international data it's expected to have special characters from many languages in the collection. I'd argue that TS should generally match an exact matcher higher than a "it's kinda the same character just without umlauts" match

Hades32 avatar Mar 07 '24 17:03 Hades32

I'd argue that TS should generally match an exact matcher higher than a "it's kinda the same character just without umlauts" match

We can only index one version of the text: either with accents and umlauts or not. Whatever formatting we do, we have to repeat the same for the query. It's not possible to run both version and prioritize one version over the other because of performance reasons.

kishorenc avatar Mar 08 '24 09:03 kishorenc

So your current fix really just does exact match on umlauts and then only returns non-umlaut matches through typo-correction? Is this the same for characters like é è ê even though they don't exist e.g. in the de locale? Wouldn't it be then better to just have a config that basically says "don't normalize to a-z"?

Hades32 avatar Mar 08 '24 12:03 Hades32

Sorry for the confusion. We can prioritize one over the other when typo_tokens_threshold is set to a larger number so that typo matches are allowed. Otherwise, since the umlaut is matched exactly only those records are returned. Same behavior for other accented characters when a locale is set.

kishorenc avatar Mar 17 '24 14:03 kishorenc

No, it won't work without setting the locale. See here: https://gist.github.com/kishorenc/d1c5fcfe8c0b3dd598b6ecdf4c466c3c

We want Ängelholm to be returned first but it's not. If you set sv locale to the title field, it will work as expected.

ah. in the other build it seemed to have stripped these chars completely though? that behavior seems fixed here, without setting locale, although it doesn't rank them correctly without setting it correctly as you point out.

osfa avatar Mar 18 '24 11:03 osfa

Yes, without locale, we will still strip those symbols since those don't have a meaning in English and which is the default locale.

kishorenc avatar Mar 18 '24 11:03 kishorenc

Released in v26

jasonbosco avatar Apr 02 '24 21:04 jasonbosco