NeMo-text-processing Some bugs in de, es and fr

Hi!

I use the latest NeMo release: 1.1.0. I found the following bugs.

Bugs

German (de):

text: Here is brettspielversand.de. norm_text: Here is b r e t t s p i e l v e r s a n d punkt de. expected output: Here is brettspielversand punkt de.
text: Sinnesbereichen.in allen Sinnen. norm_text:S i n n e s b e r e i c h e n punkt in allen Sinnen. expected output: Sinnesbereichen punkt in allen Sinnen.
text: Hier zoome ich auf die Läsion. Wir befinden uns also auf der 2D-Mammographie. norm_text:Hier zoome ich auf die Läsion. Wir befinden uns also auf der 2D-Mammographie. expected output: Hier zoome ich auf die Läsion. Wir befinden uns also auf der Zwei-D-Mammographie. (not sure)

For German normalization, I use the following code:

from nemo_text_processing.text_normalization.normalize import Normalizer

normalizer = Normalizer(
  input_case="cased",
  lang="de",
  deterministic=True,
)

norm_text = normalizer.normalize(text, punct_post_process=True)

Spanish (es):

text: El texto de Li Qin en este libro ahora está disponible en forma de libro electrónico. norm_text: El texto de quincuagésimo primero Qin en este libro ahora está disponible en forma de libro electrónico. expected output:El texto de Li Qin en este libro ahora está disponible en forma de libro electrónico. (not sure)

For Spanish normalization, I use the following code:

from nemo_text_processing.text_normalization.normalize import Normalizer

normalizer = Normalizer(
  input_case="cased",
  lang="es",
  deterministic=True,
)

norm_text = normalizer.normalize(text, punct_post_process=True)

French (fr):

text: Les Tech Clippings seront diffusés en exclusivité sur la chaîne Youtube DIGITIMES tous les vendredis à 20h. norm_text: Les Tech Clippings seront diffusés en exclusivité sur la chaîne Youtube DIGITIMES tous les vendredis à 20h. expected output:Les Tech Clippings seront diffusés en exclusivité sur la chaîne YouTube DIGITIMES tous les vendredis à 20 heures. (not sure)

For French normalization, I use the following code:

from nemo_text_processing.text_normalization.normalize import Normalizer

normalizer = Normalizer(
  input_case="cased",
  lang="fr",
  deterministic=True,
)

norm_text = normalizer.normalize(text, punct_post_process=True)

Sep 10 '24 07:09 Oktai15

@ekmb

Sep 10 '24 07:09 Oktai15

German:

Will address.
The model expects canonical punctuation, which in this case requires a whitespace following a sentence-final period. In its absence, the string will likely be transduced as a URL (hence the spacing between individual characters -- see above). As the input string contains non-standard punctuation, the output represents expected behavior.
This is a known issue. Will address.

Spanish:

This was addressed with PR #224, which didn't make it to the current release.

French:

The MEASURE semiotic class is not implemented for French TN (it is present in ITN). Will address.

Sep 10 '24 19:09 zoobereq

A fix for German (1.) and (2.) has been implemented.

Oct 17 '24 21:10 zoobereq

A fix for German (3.) has just been implemented.

Oct 23 '24 18:10 zoobereq

All bugs in this issue have been addressed.

Oct 23 '24 18:10 zoobereq