LinguaCafe German nouns incorrectly converted to lowercase

I just realised that I posted this in the wrong place. Can you have a look at it?

it is die Mannschaft, not die mannschaft

I'll also convert lemmas to lowercase by default. I never noticed it, but sometimes the lemma has an uppercase letter in it.

Is this the reason German nouns are lowercase by default although it is supossed to be uppercase? German nouns always start with a capital letter. der Fußball instead of fußball! Any way you can revert that @simjanos-dev ?

screen-2025-05-22-22-16-17

Originally posted by @TheNomad11 in https://github.com/simjanos-dev/LinguaCafe/issues/61#issuecomment-2902452605

Jul 24 '25 14:07 TheNomad11

Hi!

Ill fix this, but dont know when.

The fix for dictionary search results would be to have an additional column in dictionary tables that contains the non lowercase version, or to use a lowercase function in the search query if it doesnt hurt performance too much.

For german lemmas with uppercase first letters... I dont know what would be the solution. To fix this I would need a way to differenciate between words that are uppercase because they are the first word of a sentence or because they are a german noun. Spacy may already does this, Ill have to check.

Jul 28 '25 15:07 simjanos-dev

In German, you should just need to check if the word is a noun. All nouns are capitalized in German, otherwise I think you can safely assume lowercase, and spacey definitely has part of speech tags for German. The main issue is overlapping forms such as essen and Essen where the first is the infintive "to eat" and the latter is "food" or "meal". Most verbs can be changed into a noun form like this in German.

The fix for dictionary search results would be to have an additional column in dictionary tables that contains the non lowercase version, or to use a lowercase function in the search query if it doesnt hurt performance too much.

Would it not make more sense to use a boolean to indicate upper/lowercase instead of including the entire word in a separate column?

Aug 11 '25 22:08 cblanken

deepwiki has this answer, don't know if it is useful or helps:

Answer

You're asking about German nouns being displayed in lowercase in LinguaCafe, when they should be capitalized according to German grammar rules.

Current Behavior

German nouns in LinguaCafe are currently being converted to lowercase during text processing and dictionary operations. This happens in two main places:

1. Dictionary Import Processing

When importing dictionaries like Wiktionary, words are explicitly converted to lowercase: 1

2. Text Processing

During tokenization and lemma generation, German nouns get gender articles prepended (der/die/das) but the noun itself remains lowercase: 2

3. Search Term Trimming

When searching for words, the system trims the gender articles but keeps the noun lowercase: 3

Why This Happens

The lowercase conversion is intentional for consistency in dictionary lookups and word matching. The system needs to match words regardless of their position in a sentence (which affects capitalization in German).

Possible Solutions

Unfortunately, there's no built-in setting to change this behavior currently. To preserve German noun capitalization, you would need to modify the codebase:

Remove lowercase conversion in dictionary imports
Adjust the tokenizer to preserve original capitalization for German nouns (identified by POS tag 'NOUN')
Update search logic to handle case-insensitive matching while preserving display capitalization

This would require changes across multiple files in both the PHP backend and Python tokenizer service.

Notes

This is a known limitation of the current text processing pipeline. German is one of the supported languages with gender tagging, but the capitalization convention for nouns is not currently preserved. 4 If you'd like this feature added, consider creating a feature request on the GitHub repository as mentioned in the contributing guidelines. 5

Wiki pages you might want to explore:

Text Processing Pipeline (simjanos-dev/LinguaCafe)
Dictionary System (simjanos-dev/LinguaCafe)

Citations

File: app/Services/DictionaryImportService.php (L658-658)

            $word = mb_strtolower($word, 'UTF-8');

File: app/Services/TextBlockService.php (L243-258)

            if ($this->language == 'german') { 
                // nouns' lemma needs der/die/das before them
                if ($this->tokenizedWords[$wordIndex]->pos == 'NOUN' && $this->tokenizedWords[$wordIndex]->l !== '') {
                    if (count($this->tokenizedWords[$wordIndex]->g) && $this->tokenizedWords[$wordIndex]->g[0] =='Fem') {
                        $word->lemma = 'die ' . $word->lemma;
                    }

                    if (count($this->tokenizedWords[$wordIndex]->g) && $this->tokenizedWords[$wordIndex]->g[0] == 'Masc') {
                        $word->lemma = 'der ' . $word->lemma;
                    }

                    if (count($this->tokenizedWords[$wordIndex]->g) && $this->tokenizedWords[$wordIndex]->g[0] == 'Neut') {
                        $word->lemma = 'das ' . $word->lemma;
                    }
                    
                }

File: resources/js/components/Text/TextBlockGroup.vue (L1971-1982)

                // german
                if (this.$props.language == 'german' && searchTerm.substring(0, 4) == 'die ') {
                    trimmedSearchTerm = searchTerm.slice(4);
                }

                if (this.$props.language == 'german' && searchTerm.substring(0, 4) == 'der ') {
                    trimmedSearchTerm = searchTerm.slice(4);
                }

                if (this.$props.language == 'german' && searchTerm.substring(0, 4) == 'das ') {
                    trimmedSearchTerm = searchTerm.slice(4);
                }

File: manual/Setup.md (L217-217)

| <img src='images/flags/german.png' width='25'>    | German    | &check; | &check;          | &check;             | wiktionary, dict cc   |

File: CONTRIBUTING.md (L7-11)

#### Feature requests
In general I'm happy to add most feature requests to the list, but I would also like to keep the scope of LinguaCafe similar to when it's started. Currently I would like to avoid adding a few things:
- Most large AI tools.
- Any built-in video features.
- Niche features that only a small percent of the users would use and difficult to maintain.

Nov 23 '25 21:11 TheNomad11