ocrd-segment-repair: handle case where points is empty
Version 0.1.20, ocrd/core 2.33.0
I have a PAGE file, which does not have any real content - like this:
<pc:Page imageFilename="OCR-D-IMG/0038_IMAGE000918_00001.tif" imageWidth="1420" imageHeight="2313" orientation="0.">
<pc:AlternativeImage filename="OCR-D-BIN/OCR-D-BIN_0038_IMAGE000918_00001.IMG-BIN.png" comments=",binarized"/>
<pc:TextRegion id="TR-1" orientation="0.">
<pc:Coords points=""/>
</pc:TextRegion>
</pc:Page>
If I call ocrd-segment-extract-lines, I get an expection like this:
09:19:19.733 DEBUG ocrd.workspace.image_from_page - page 'P_0038_IMAGE000918_00001' has orientation=0 skew=0.00
09:19:19.733 DEBUG ocrd.workspace.image_from_page - Using AlternativeImage 1 {'', 'binarized'} for page 'P_0038_IMAGE000918_00001'
09:19:19.734 DEBUG ocrd.workspace.download_file - download_file <OcrdFile fileGrp=OCR-D-BIN ID=OCR-D-BIN_0038_IMAGE000918_00001.IMG-BIN, mimetype=image/png, url=OCR-D-BIN/OCR-D-BIN_0038_IMAGE000918_00001.IMG-BIN.png, local_filename=OCR-D-BIN/OCR-D-BIN_0038_IMAGE000918_00001.IMG-BIN.png]/> [_recursion_count=0]
09:19:19.735 DEBUG PIL.PngImagePlugin - STREAM b'IHDR' 16 13
09:19:19.735 DEBUG PIL.PngImagePlugin - STREAM b'IDAT' 41 65536
Traceback (most recent call last):
File "/home/ocrdadmin/ocrd_all/venv/sub-venv/headless-tf1/bin/ocrd-segment-extract-lines", line 8, in <module>
sys.exit(ocrd_segment_extract_lines())
File "/home/ocrdadmin/ocrd_all/venv/sub-venv/headless-tf1/lib/python3.6/site-packages/click/core.py", line 1128, in __call__
return self.main(*args, **kwargs)
File "/home/ocrdadmin/ocrd_all/venv/sub-venv/headless-tf1/lib/python3.6/site-packages/click/core.py", line 1053, in main
rv = self.invoke(ctx)
File "/home/ocrdadmin/ocrd_all/venv/sub-venv/headless-tf1/lib/python3.6/site-packages/click/core.py", line 1395, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/home/ocrdadmin/ocrd_all/venv/sub-venv/headless-tf1/lib/python3.6/site-packages/click/core.py", line 754, in invoke
return __callback(*args, **kwargs)
File "/home/ocrdadmin/ocrd_all/venv/sub-venv/headless-tf1/lib/python3.6/site-packages/ocrd_segment/cli.py", line 65, in ocrd_segment_extract_lines
return ocrd_cli_wrap_processor(ExtractLines, *args, **kwargs)
File "/home/ocrdadmin/ocrd_all/venv/sub-venv/headless-tf1/lib/python3.6/site-packages/ocrd/decorators/__init__.py", line 88, in ocrd_cli_wrap_processor
run_processor(processorClass, ocrd_tool, mets, workspace=workspace, **kwargs)
File "/home/ocrdadmin/ocrd_all/venv/sub-venv/headless-tf1/lib/python3.6/site-packages/ocrd/processor/helpers.py", line 88, in run_processor
processor.process()
File "/home/ocrdadmin/ocrd_all/venv/sub-venv/headless-tf1/lib/python3.6/site-packages/ocrd_segment/extract_lines.py", line 171, in process
transparency=self.parameter['transparency'])
File "/home/ocrdadmin/ocrd_all/venv/sub-venv/headless-tf1/lib/python3.6/site-packages/ocrd/workspace.py", line 829, in image_from_segment
fill=fill, transparency=transparency)
File "/home/ocrdadmin/ocrd_all/venv/sub-venv/headless-tf1/lib/python3.6/site-packages/ocrd/workspace.py", line 1012, in _crop
segment_polygon = coordinates_of_segment(segment, parent_image, parent_coords)
File "/home/ocrdadmin/ocrd_all/venv/sub-venv/headless-tf1/lib/python3.6/site-packages/ocrd_utils/image.py", line 136, in coordinates_of_segment
polygon = np.array(polygon_from_points(segment.get_Coords().points))
File "/home/ocrdadmin/ocrd_all/venv/sub-venv/headless-tf1/lib/python3.6/site-packages/ocrd_utils/image.py", line 148, in polygon_from_points
polygon.append([float(x_y[0]), float(x_y[1])])
ValueError: could not convert string to float:
My expection would be, that this PAGE file simply would be ignored. --> please, clarify ...
The problem is that you have a text region with empty Coords - this is not allowed in the PAGE-XML schema, you should get
Value '' is not facet-valid with respect to pattern '([0-9]+,[0-9]+ )+([0-9]+,[0-9]+)' for type 'PointsType'.
as error message.
How was the empty PAGE generated? If it's by an OCR-D processor, we need to fix it.
This error I have made by my own ;-) - I know that I need to correct something in my code - but still as it only occurs once, I cannot go on with all the other regions ... Just, would be nice, if extract-lines would be a bit more robust ...
We've discussed whether OCR-D processors should be robust to invalid or unconventional PAGE in the past. IIRC the general consensus was that it would overstretch both the coding effort (much more boilerplate and things one can do wrong or forget to do) and the performance.
So the idea is to selectively use ocrd-segment-repair if you know you have problems in your input (or after some processor's output). Not sure if your particular case (missing @points) is already covered though.
Understood, of course. In this special I already have fixed the root cause. Therefore, no need to do something like ocrd-segment-repair.
I will close this issue here, now.
Therefore, no need to do something like
ocrd-segment-repair.
Too bad – I was quite curious how it would handle that case, you know :-)
I was quite curious how it would handle that case, you know :-)
You guessed it: it wouldn't work!
I created https://github.com/OCR-D/core/issues/877 for the core side, but we also have to handle that case differently in the repair code here. So let's keep open, and I'll rename the issue.