pirolen issues

Results 9 issues of


                                            pirolen

Validation of ucto output fails due to space character in FoLiA output from Piereling

After converting a document from docx to FoLiA using Piereling (@proycon: I did not find a command line option for such a conversion), the FoLiA document contains (hidden/small) space characters,...

Error upon submitting correction annotations

I got errors on two files upon submitting correction annotations, and those files would not open anymore, there is nginx gateway timeout signalled. I am attaching the docserver logs here...

bug

Develop a tokenizer for Premodern Slavic

Hi, I would be happy to contribute data and insights that would help develop a tokenizer for Medieval/Premodern Slavic. Currently I am using tokconfig-rus on this data, and there'd be...

Loading the confusables file

I wonder if this is the right way to loading the confusables file: ``` m = build_variant_model(alphabet_file, weightsconfig=ws1) m.read_confusablelist(confusables_file) ``` It would be brilliant to have an example about how...

question

Question: Splitting runons

Hi, I wonder if there is a way to have analiticcl generate variants that involve a whitespace: i.e. in case of runon errors, suggesting the split form. Suppose that 'holygrail'...

question

Accessing hyphenated tokens at the end of a paragraph

Hyphens are source of to some more problems in certain types of documents: e.g tokens at the end of a paragraph that end with a hyphen are not valid tokens,...

question

Adding the tokenizer contents to a FoLiA doc

I wonder if there is a straightforward way to add from the tokenizer the sentences and their token content to build a new folia doc. It is not clear to...

question

Question: possible to retrieve untokenized sentences?

May sound silly, but would it be possible to create a method that would allow retrieving sentences from the tokenizer without whitespace between punctuation marks (e.g. untokenized)? E.g. maybe providing...

enhancement

Question: Abbreviations list

What is the best way to supply a list of known abbreviations to python-ucto and ucto in LaMachine?

question