Better sentence chunking algorithm to fix edge cases ("etc.", etc.)

Open lionel-rowe opened this issue 2 months ago • 0 comments

Fixes https://github.com/remsky/Kokoro-FastAPI/issues/308.

In the end I didn't use ICU directly, since ICU support in Python can be a bit finnicky as a dependency and can't be added with a simple pip install.

Instead, I re-implemented the Unicode sentence segmentation algorithm as a pure Python module, unicode-segment. It's a fully compliant implementation (passes all the tests in the TR29 test suite) ~~but kinda slow~~ and, after some optimization, decently performant for inputs of a reasonable size (typically sub-ms per 1k chars, YMMV depending on hardware etc).

It's also a deterministic algorithm, which means there are still certain corner cases it will get wrong:

UAX #29’s sentence boundary rules are a lot smarter than just treating every full stop as the end of a sentence. But they’re not perfect. In the string "Dr. John works at I.B.M., doesn't he?", asked Alice. "Yes," replied Charlie., the regex \b{sb}.+?\b{sb} finds 3 matches: "Dr. , John works at I.B.M., doesn't he?", asked Alice. , and "Yes," replied Charlie.. A full stop ends a sentence if it is followed by a capital letter. The question mark does not trigger a sentence break because of the comma that follows, even with the quote in between.

Without using a neural-based approach for sentence segmenting or applying some hacky, ad-hoc solution, there's not much that can be done about these.

Nov 14 '25 15:11 lionel-rowe