Kay-Michael Würzner issues

Results 11 issues of


                                            Kay-Michael Würzner

Add continuous integration

... based on the test data. Maybe even with the new GitHub Actions.

enhancement

Split regions if they are "split" by a separator

![image](https://user-images.githubusercontent.com/26707219/72509648-41258480-3848-11ea-9a1b-0c485ba9c01b.png)

enhancement

question

Remove text regions which contain no lines...

(... if line detection took place)

enhancement

Block segmentation produces almost always empty pages

I am running the following workflow on https://digital.slub-dresden.de/werkansicht/dlf/87237/1/(with https://digital.slub-dresden.de/data/kitodo/adrefudio_20253082Z_1907/adrefudio_20253082Z_1907_mets.xml): 1. Cropping (`ocrd-anybaseocr-crop`) 2. Binarization (`ocrd-anybaseocr-binarize`) 3. Segmentation (`ocrd-anybaseocr-block-segmentation`) For most pages, the block segmentation finds only a few and very...

Question: Is "n-best" tagging possible with CRFSuite?

The [Wapiti](https://wapiti.limsi.fr/) CRF toolkit has a neat feature called *N-best Viterbi output* which returns the *n*-best label sequences for an input sequence. Is there a similar functionality in `crfsuite`? Thanks...

[Do not merge] Implement a poor man's solution for extracting

gender information from German Wiktionary. Not very smart but I do not know any Haskell. For my purposes, it works and may serve as a starting point for fixing https://github.com/LuminosoInsight/wikiparsec/issues/4

Feature request: Add gender to the information extracted from the German wiktionary dump

Each article title for nouns has information on the gender of the corresponding noun. It would be very helpful to have them extracted as well.

Feature request: Add IPA and hyphenation to the information extracted from the German wiktionary dump

Many thanks for your wonderful tool! It would be a great addition to have the hyphenation patterns and the IPA representation in the set of extracted information.

Licensing of the repo/models/data

Many thanks for your great efforts! I'd like to train a Tesseract model from your data via https://github.com/tesseract-ocr/tesstrain and contribute it to https://github.com/tesseract-ocr/tessdata_contrib. However, I am not sure whether this...

Add a parameter for selection of text level (PAGE XML)

Currently, `dinglehopper` extracts text from PAGE XML files on the region level (https://github.com/qurator-spk/dinglehopper/blob/master/qurator/dinglehopper/ocr_files.py#L50). It would be wonderful if you could add a level-of-operation parameter to allow for extraction from line...

enhancement