python-ucto icon indicating copy to clipboard operation
python-ucto copied to clipboard

This is a Python binding to the tokenizer Ucto. Tokenisation is one of the first step in almost any Natural Language Processing task, yet it is not always as trivial a task as it appears to be. This b...

Results 5 python-ucto issues
Sort by recently updated
recently updated
newest added

Hyphens are source of to some more problems in certain types of documents: e.g tokens at the end of a paragraph that end with a hyphen are not valid tokens,...

question

I wonder if there is a straightforward way to add from the tokenizer the sentences and their token content to build a new folia doc. It is not clear to...

question

currently not accessible from Python.

enhancement

May sound silly, but would it be possible to create a method that would allow retrieving sentences from the tokenizer without whitespace between punctuation marks (e.g. untokenized)? E.g. maybe providing...

enhancement

What is the best way to supply a list of known abbreviations to python-ucto and ucto in LaMachine?

question