German nouns incorrectly converted to lowercase
I just realised that I posted this in the wrong place. Can you have a look at it?
it is die Mannschaft, not die mannschaft
I'll also convert lemmas to lowercase by default. I never noticed it, but sometimes the lemma has an uppercase letter in it.
Is this the reason German nouns are lowercase by default although it is supossed to be uppercase? German nouns always start with a capital letter. der Fußball instead of fußball! Any way you can revert that @simjanos-dev ?
Originally posted by @TheNomad11 in https://github.com/simjanos-dev/LinguaCafe/issues/61#issuecomment-2902452605
Hi!
Ill fix this, but dont know when.
The fix for dictionary search results would be to have an additional column in dictionary tables that contains the non lowercase version, or to use a lowercase function in the search query if it doesnt hurt performance too much.
For german lemmas with uppercase first letters... I dont know what would be the solution. To fix this I would need a way to differenciate between words that are uppercase because they are the first word of a sentence or because they are a german noun. Spacy may already does this, Ill have to check.
In German, you should just need to check if the word is a noun. All nouns are capitalized in German, otherwise I think you can safely assume lowercase, and spacey definitely has part of speech tags for German. The main issue is overlapping forms such as essen and Essen where the first is the infintive "to eat" and the latter is "food" or "meal". Most verbs can be changed into a noun form like this in German.
The fix for dictionary search results would be to have an additional column in dictionary tables that contains the non lowercase version, or to use a lowercase function in the search query if it doesnt hurt performance too much.
Would it not make more sense to use a boolean to indicate upper/lowercase instead of including the entire word in a separate column?
deepwiki has this answer, don't know if it is useful or helps:
Answer
You're asking about German nouns being displayed in lowercase in LinguaCafe, when they should be capitalized according to German grammar rules.
Current Behavior
German nouns in LinguaCafe are currently being converted to lowercase during text processing and dictionary operations. This happens in two main places:
1. Dictionary Import Processing
When importing dictionaries like Wiktionary, words are explicitly converted to lowercase: 1
2. Text Processing
During tokenization and lemma generation, German nouns get gender articles prepended (der/die/das) but the noun itself remains lowercase: 2
3. Search Term Trimming
When searching for words, the system trims the gender articles but keeps the noun lowercase: 3
Why This Happens
The lowercase conversion is intentional for consistency in dictionary lookups and word matching. The system needs to match words regardless of their position in a sentence (which affects capitalization in German).
Possible Solutions
Unfortunately, there's no built-in setting to change this behavior currently. To preserve German noun capitalization, you would need to modify the codebase:
- Remove lowercase conversion in dictionary imports
- Adjust the tokenizer to preserve original capitalization for German nouns (identified by POS tag 'NOUN')
- Update search logic to handle case-insensitive matching while preserving display capitalization
This would require changes across multiple files in both the PHP backend and Python tokenizer service.
Notes
This is a known limitation of the current text processing pipeline. German is one of the supported languages with gender tagging, but the capitalization convention for nouns is not currently preserved. 4 If you'd like this feature added, consider creating a feature request on the GitHub repository as mentioned in the contributing guidelines. 5
Wiki pages you might want to explore:
- Text Processing Pipeline (simjanos-dev/LinguaCafe)
- Dictionary System (simjanos-dev/LinguaCafe)
Citations
File: app/Services/DictionaryImportService.php (L658-658)
$word = mb_strtolower($word, 'UTF-8');
File: app/Services/TextBlockService.php (L243-258)
if ($this->language == 'german') {
// nouns' lemma needs der/die/das before them
if ($this->tokenizedWords[$wordIndex]->pos == 'NOUN' && $this->tokenizedWords[$wordIndex]->l !== '') {
if (count($this->tokenizedWords[$wordIndex]->g) && $this->tokenizedWords[$wordIndex]->g[0] =='Fem') {
$word->lemma = 'die ' . $word->lemma;
}
if (count($this->tokenizedWords[$wordIndex]->g) && $this->tokenizedWords[$wordIndex]->g[0] == 'Masc') {
$word->lemma = 'der ' . $word->lemma;
}
if (count($this->tokenizedWords[$wordIndex]->g) && $this->tokenizedWords[$wordIndex]->g[0] == 'Neut') {
$word->lemma = 'das ' . $word->lemma;
}
}
File: resources/js/components/Text/TextBlockGroup.vue (L1971-1982)
// german
if (this.$props.language == 'german' && searchTerm.substring(0, 4) == 'die ') {
trimmedSearchTerm = searchTerm.slice(4);
}
if (this.$props.language == 'german' && searchTerm.substring(0, 4) == 'der ') {
trimmedSearchTerm = searchTerm.slice(4);
}
if (this.$props.language == 'german' && searchTerm.substring(0, 4) == 'das ') {
trimmedSearchTerm = searchTerm.slice(4);
}
File: manual/Setup.md (L217-217)
| <img src='images/flags/german.png' width='25'> | German | ✓ | ✓ | ✓ | wiktionary, dict cc |
File: CONTRIBUTING.md (L7-11)
#### Feature requests
In general I'm happy to add most feature requests to the list, but I would also like to keep the scope of LinguaCafe similar to when it's started. Currently I would like to avoid adding a few things:
- Most large AI tools.
- Any built-in video features.
- Niche features that only a small percent of the users would use and difficult to maintain.