Some bugs in de, es and fr
Hi!
I use the latest NeMo release: 1.1.0. I found the following bugs.
Bugs
German (de):
-
text:
Here is brettspielversand.de.norm_text:Here is b r e t t s p i e l v e r s a n d punkt de.expected output:Here is brettspielversand punkt de. -
text:
Sinnesbereichen.in allen Sinnen.norm_text:S i n n e s b e r e i c h e n punkt in allen Sinnen.expected output:Sinnesbereichen punkt in allen Sinnen. -
text:
Hier zoome ich auf die Läsion. Wir befinden uns also auf der 2D-Mammographie.norm_text:Hier zoome ich auf die Läsion. Wir befinden uns also auf der 2D-Mammographie.expected output:Hier zoome ich auf die Läsion. Wir befinden uns also auf der Zwei-D-Mammographie.(not sure)
For German normalization, I use the following code:
from nemo_text_processing.text_normalization.normalize import Normalizer
normalizer = Normalizer(
input_case="cased",
lang="de",
deterministic=True,
)
norm_text = normalizer.normalize(text, punct_post_process=True)
Spanish (es):
- text:
El texto de Li Qin en este libro ahora está disponible en forma de libro electrónico.norm_text:El texto de quincuagésimo primero Qin en este libro ahora está disponible en forma de libro electrónico.expected output:El texto de Li Qin en este libro ahora está disponible en forma de libro electrónico.(not sure)
For Spanish normalization, I use the following code:
from nemo_text_processing.text_normalization.normalize import Normalizer
normalizer = Normalizer(
input_case="cased",
lang="es",
deterministic=True,
)
norm_text = normalizer.normalize(text, punct_post_process=True)
French (fr):
- text:
Les Tech Clippings seront diffusés en exclusivité sur la chaîne Youtube DIGITIMES tous les vendredis à 20h.norm_text:Les Tech Clippings seront diffusés en exclusivité sur la chaîne Youtube DIGITIMES tous les vendredis à 20h.expected output:Les Tech Clippings seront diffusés en exclusivité sur la chaîne YouTube DIGITIMES tous les vendredis à 20 heures.(not sure)
For French normalization, I use the following code:
from nemo_text_processing.text_normalization.normalize import Normalizer
normalizer = Normalizer(
input_case="cased",
lang="fr",
deterministic=True,
)
norm_text = normalizer.normalize(text, punct_post_process=True)
@ekmb
German:
- Will address.
- The model expects canonical punctuation, which in this case requires a whitespace following a sentence-final period. In its absence, the string will likely be transduced as a URL (hence the spacing between individual characters -- see above). As the input string contains non-standard punctuation, the output represents expected behavior.
- This is a known issue. Will address.
Spanish:
- This was addressed with PR #224, which didn't make it to the current release.
French:
- The
MEASUREsemiotic class is not implemented for French TN (it is present in ITN). Will address.
A fix for German (1.) and (2.) has been implemented.
A fix for German (3.) has just been implemented.
All bugs in this issue have been addressed.