Better sentence chunking algorithm to fix edge cases ("etc.", etc.)
Fixes https://github.com/remsky/Kokoro-FastAPI/issues/308.
In the end I didn't use ICU directly, since ICU support in Python can be a bit finnicky as a dependency and can't be added with a simple pip install.
Instead, I re-implemented the Unicode sentence segmentation algorithm as a pure Python module, unicode-segment. It's a fully compliant implementation (passes all the tests in the TR29 test suite) ~~but kinda slow~~ and, after some optimization, decently performant for inputs of a reasonable size (typically sub-ms per 1k chars, YMMV depending on hardware etc).
It's also a deterministic algorithm, which means there are still certain corner cases it will get wrong:
UAX #29’s sentence boundary rules are a lot smarter than just treating every full stop as the end of a sentence. But they’re not perfect. In the string
"Dr. John works at I.B.M., doesn't he?", asked Alice. "Yes," replied Charlie., the regex\b{sb}.+?\b{sb}finds 3 matches:"Dr.,John works at I.B.M., doesn't he?", asked Alice., and"Yes," replied Charlie.. A full stop ends a sentence if it is followed by a capital letter. The question mark does not trigger a sentence break because of the comma that follows, even with the quote in between.
Without using a neural-based approach for sentence segmenting or applying some hacky, ad-hoc solution, there's not much that can be done about these.