Develop a tokenizer for Premodern Slavic

Open pirolen opened this issue 2 years ago • 0 comments

Hi, I would be happy to contribute data and insights that would help develop a tokenizer for Medieval/Premodern Slavic. Currently I am using tokconfig-rus on this data, and there'd be room for improvement; e.g. sentences are either very short or very long, please see below for some examples.

Some of the data characteristics:

the character set of this data is nonstandard, incl. punctuation
sentence delimiters are typically nonstandard or nonexistent (· or ∙ are often used between words but are typically not true sentence delimiters)

There are no real gold standards of orthography in this period, and I guess also no very strong gold labeled data. I looked into the Stanza and the UDPipe sentence splitters but they worked suboptimally.

Would you be interested in creating a premodern slavic config? Or would you suggest another approach?

Jul 17 '23 13:07 pirolen