Robert Sachunsky
Robert Sachunsky
Good plan IMO. You have to get every MP on board though (or we need to support both ways).
> 1. It does not solve the problem with conflicting configurations. I don't see that with _system dependencies_ (yet). We did have (and will have) conflicting requirements for Python packages,...
> * `mets:transformFile` is probably the most METS-compliant mechanism I'm not so sure about that. It comes with an obligatory `@TRANSFORMTYPE` restricted to either `decompression` or `decryption`. We could ignore...
Agreed! > * `page_element_unicode0` > * `page_element_conf0` Maybe these could go as member functions `get_Unicode0` and `get_conf0` into `GlyphType`, `WordType`, `TextLineType` and `TextRegionType`. > * `page_get_reading_order` I use this a...
Okay, after consulting with @wrznr I now believe that on the contrary, derived images **must** indeed be keeping DPI meta-data (whether or not these are to be trusted, or where...
Although we now have `shrink_polygons` (#162) as alternative solution (on all hierarchy levels), but `GetImage` may still be useful in some circumstances: - if the hull polygon still overlaps neighbours...
So how about the following parameters for an opt-in (each having the segment images annotated as derived images): - ocrd-tesserocr-segment and ocrd-tesserocr-recognize: array parameter `add_alternativeimages` with values `region`, `line`, `word`...
> 2\. modified by `GetImage(RIL.SYMBOL, 0, None)`: Unfortunately, this **only** works with `None` as 3rd arg, which is equivalent to `GetBinaryImage(RIL.SYMBOL)`. One can pass the raw image there, but Tesseract...
Thanks @jbarth-ubhd for the detailed report! Well, this is an extreme case to begin with: a huge image (65 MP), images with lots of fine strokes. Tesseract itself has no...
BTW, running on the binarized image (as in your workflow), it takes even longer (77h), because the wolf binarization cannot cope with the black border (which it inverts), so even...