ocrd_segment icon indicating copy to clipboard operation
ocrd_segment copied to clipboard

ocrd-segment-repair: handle case where points is empty

Open stefanCCS opened this issue 3 years ago • 6 comments

Version 0.1.20, ocrd/core 2.33.0

I have a PAGE file, which does not have any real content - like this:

    <pc:Page imageFilename="OCR-D-IMG/0038_IMAGE000918_00001.tif" imageWidth="1420" imageHeight="2313" orientation="0.">
        <pc:AlternativeImage filename="OCR-D-BIN/OCR-D-BIN_0038_IMAGE000918_00001.IMG-BIN.png" comments=",binarized"/>
        <pc:TextRegion id="TR-1" orientation="0.">
            <pc:Coords points=""/>
        </pc:TextRegion>
    </pc:Page>

If I call ocrd-segment-extract-lines, I get an expection like this:

09:19:19.733 DEBUG ocrd.workspace.image_from_page - page 'P_0038_IMAGE000918_00001' has  orientation=0 skew=0.00
09:19:19.733 DEBUG ocrd.workspace.image_from_page - Using AlternativeImage 1 {'', 'binarized'} for page 'P_0038_IMAGE000918_00001'
09:19:19.734 DEBUG ocrd.workspace.download_file - download_file <OcrdFile fileGrp=OCR-D-BIN ID=OCR-D-BIN_0038_IMAGE000918_00001.IMG-BIN, mimetype=image/png, url=OCR-D-BIN/OCR-D-BIN_0038_IMAGE000918_00001.IMG-BIN.png, local_filename=OCR-D-BIN/OCR-D-BIN_0038_IMAGE000918_00001.IMG-BIN.png]/>  [_recursion_count=0]
09:19:19.735 DEBUG PIL.PngImagePlugin - STREAM b'IHDR' 16 13
09:19:19.735 DEBUG PIL.PngImagePlugin - STREAM b'IDAT' 41 65536
Traceback (most recent call last):
  File "/home/ocrdadmin/ocrd_all/venv/sub-venv/headless-tf1/bin/ocrd-segment-extract-lines", line 8, in <module>
    sys.exit(ocrd_segment_extract_lines())
  File "/home/ocrdadmin/ocrd_all/venv/sub-venv/headless-tf1/lib/python3.6/site-packages/click/core.py", line 1128, in __call__
    return self.main(*args, **kwargs)
  File "/home/ocrdadmin/ocrd_all/venv/sub-venv/headless-tf1/lib/python3.6/site-packages/click/core.py", line 1053, in main
    rv = self.invoke(ctx)
  File "/home/ocrdadmin/ocrd_all/venv/sub-venv/headless-tf1/lib/python3.6/site-packages/click/core.py", line 1395, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/ocrdadmin/ocrd_all/venv/sub-venv/headless-tf1/lib/python3.6/site-packages/click/core.py", line 754, in invoke
    return __callback(*args, **kwargs)
  File "/home/ocrdadmin/ocrd_all/venv/sub-venv/headless-tf1/lib/python3.6/site-packages/ocrd_segment/cli.py", line 65, in ocrd_segment_extract_lines
    return ocrd_cli_wrap_processor(ExtractLines, *args, **kwargs)
  File "/home/ocrdadmin/ocrd_all/venv/sub-venv/headless-tf1/lib/python3.6/site-packages/ocrd/decorators/__init__.py", line 88, in ocrd_cli_wrap_processor
    run_processor(processorClass, ocrd_tool, mets, workspace=workspace, **kwargs)
  File "/home/ocrdadmin/ocrd_all/venv/sub-venv/headless-tf1/lib/python3.6/site-packages/ocrd/processor/helpers.py", line 88, in run_processor
    processor.process()
  File "/home/ocrdadmin/ocrd_all/venv/sub-venv/headless-tf1/lib/python3.6/site-packages/ocrd_segment/extract_lines.py", line 171, in process
    transparency=self.parameter['transparency'])
  File "/home/ocrdadmin/ocrd_all/venv/sub-venv/headless-tf1/lib/python3.6/site-packages/ocrd/workspace.py", line 829, in image_from_segment
    fill=fill, transparency=transparency)
  File "/home/ocrdadmin/ocrd_all/venv/sub-venv/headless-tf1/lib/python3.6/site-packages/ocrd/workspace.py", line 1012, in _crop
    segment_polygon = coordinates_of_segment(segment, parent_image, parent_coords)
  File "/home/ocrdadmin/ocrd_all/venv/sub-venv/headless-tf1/lib/python3.6/site-packages/ocrd_utils/image.py", line 136, in coordinates_of_segment
    polygon = np.array(polygon_from_points(segment.get_Coords().points))
  File "/home/ocrdadmin/ocrd_all/venv/sub-venv/headless-tf1/lib/python3.6/site-packages/ocrd_utils/image.py", line 148, in polygon_from_points
    polygon.append([float(x_y[0]), float(x_y[1])])
ValueError: could not convert string to float: 

My expection would be, that this PAGE file simply would be ignored. --> please, clarify ...

stefanCCS avatar Jun 08 '22 09:06 stefanCCS

The problem is that you have a text region with empty Coords - this is not allowed in the PAGE-XML schema, you should get

Value '' is not facet-valid with respect to pattern '([0-9]+,[0-9]+ )+([0-9]+,[0-9]+)' for type 'PointsType'.

as error message.

How was the empty PAGE generated? If it's by an OCR-D processor, we need to fix it.

kba avatar Jun 08 '22 11:06 kba

This error I have made by my own ;-) - I know that I need to correct something in my code - but still as it only occurs once, I cannot go on with all the other regions ... Just, would be nice, if extract-lines would be a bit more robust ...

stefanCCS avatar Jun 08 '22 12:06 stefanCCS

We've discussed whether OCR-D processors should be robust to invalid or unconventional PAGE in the past. IIRC the general consensus was that it would overstretch both the coding effort (much more boilerplate and things one can do wrong or forget to do) and the performance.

So the idea is to selectively use ocrd-segment-repair if you know you have problems in your input (or after some processor's output). Not sure if your particular case (missing @points) is already covered though.

bertsky avatar Jun 08 '22 12:06 bertsky

Understood, of course. In this special I already have fixed the root cause. Therefore, no need to do something like ocrd-segment-repair. I will close this issue here, now.

stefanCCS avatar Jun 08 '22 12:06 stefanCCS

Therefore, no need to do something like ocrd-segment-repair.

Too bad – I was quite curious how it would handle that case, you know :-)

bertsky avatar Jun 08 '22 14:06 bertsky

I was quite curious how it would handle that case, you know :-)

You guessed it: it wouldn't work!

I created https://github.com/OCR-D/core/issues/877 for the core side, but we also have to handle that case differently in the repair code here. So let's keep open, and I'll rename the issue.

bertsky avatar Jun 08 '22 14:06 bertsky