php-text-analysis icon indicating copy to clipboard operation
php-text-analysis copied to clipboard

Issue with German Umlauts using "PHP Rapid Automatic Keyword Extraction"

Open menturion opened this issue 1 year ago • 2 comments

Hi, many thanks for this amazing script!

I tested your "PHP Rapid Automatic Keyword Extraction" example (shown here https://github.com/yooper/php-text-analysis/wiki/PHP-Rapid-Automatic-Keyword-Extraction) and noticed that there are issues with special chars like the German Umlauts.

I tested it with the German stop word list ("stop-words_german_1_de.txt").

It listed [verst rkte] => 8 as a keyword/score (n-gram = 2), which should be [verstärkte] => 8 and seems to interpret all words that contain a German Umlauts as multiple words in all cases by replacing each German Umlaut by a space " ", see the aforementioned example verst and rkte instead of "verstärkte".

Is there any way to fix this? I tried to convert input text to UTF-8 w/o any impact on this issue.

menturion avatar Dec 08 '24 16:12 menturion

I would need your help with fixing that issue.

yooper avatar Dec 08 '24 19:12 yooper

It can be fixed by changing the regex in the Lambda filter in ...

            $lambdaFunc = function($word){
                return  preg_replace('/[\x00-\x1F\x80-\xFF]/u', ' ', $word);
            };

from ...

'/[\x00-\x1F\x80-\xFF]/u'

to e.g. ...

'/[\x00-\x1F\x07F]/u'

because it preserves language specific characters of European languages like German Umlauts (ä, ö, ü), Spanish characters (á, é, í, ñ), French characters (é, è, ê, ç) etc.

menturion avatar Dec 09 '24 06:12 menturion