NeMo icon indicating copy to clipboard operation
NeMo copied to clipboard

text normalization documentation should mention the 500 word limit

Open f4hy opened this issue 3 years ago • 0 comments

Is your feature request related to a problem? Please describe.

normalize() has a 500 word limit per here: https://github.com/NVIDIA/NeMo/blob/main/nemo_text_processing/text_normalization/normalize.py#L255

This is not documented in the documentation pages as far as I can see.

Describe the solution you'd like

Documentation should make this requirement clear up front, so when developing a pipeline which uses nemo text normalization one is not surprised by crashes when are large input comes through. Instead developers could split into sentences knowing it is a requirement of this method.

Bonus: explain why there is a 500 word limit.

Describe alternatives you've considered

Seems like a text normalizer model would know where sentence bounds are to be able to operate, so the normalizer could split into sentences itself and process the sentences, removing this limitation.

f4hy avatar Jul 22 '22 17:07 f4hy

Thanks @f4hy! We will fix this and update the doc.

yzhang123 avatar Aug 10 '22 16:08 yzhang123

https://github.com/NVIDIA/NeMo/pull/4721

yzhang123 avatar Aug 10 '22 20:08 yzhang123

@f4hy the wfst right now cannot handle longer input, the composition will fail. We would like to keep it simple in the doc for the user and added a warning of an existing limit.

The reason we prefer the user to split the sentence is, it is non-trivial if we do not know what text format the user uses. Often the user has more insight into how the sentences are formed, and where they end. You are right, we can leverage our wfst graph to detect sentence boundaries, but if the input is too long we will have the same problem as above since composition will fail. If you look up common tools like Scipy or Moses, they provide simple regex rules like splitting by punctuation, but it will fail for the simplest cases especially if you have many semiotic tokens, e.g. "2.5" can mean "two point five" or "two. five" depending on context

yzhang123 avatar Aug 10 '22 20:08 yzhang123