python-ucto
python-ucto copied to clipboard
This is a Python binding to the tokenizer Ucto. Tokenisation is one of the first step in almost any Natural Language Processing task, yet it is not always as trivial a task as it appears to be. This b...
Hyphens are source of to some more problems in certain types of documents: e.g tokens at the end of a paragraph that end with a hyphen are not valid tokens,...
I wonder if there is a straightforward way to add from the tokenizer the sentences and their token content to build a new folia doc. It is not clear to...
May sound silly, but would it be possible to create a method that would allow retrieving sentences from the tokenizer without whitespace between punctuation marks (e.g. untokenized)? E.g. maybe providing...
What is the best way to supply a list of known abbreviations to python-ucto and ucto in LaMachine?