core save_image_file should set DPI for derived images

Currently, any information on image resolution provided in the original image (and made available via OcrdExif in Workspace.image_from_page) is ignored when saving derived images in the workspace (via Workspace.save_image_file). Due to PIL.Image format internals, the PNG then contains a setting of 72 DPI however. This might create problems for processors that look at the derived image files alone.

But this is hard to fix in core: the image passed to save_image_file could come from anywhere (and usually does not have an info['dpi']; even simple PIL.Image operations omit that in the result).

Realistically though, it will have been created some way from the source image file under the same pageId, and since rescaling is currently not permitted in the spec, one could assume the same DPI for all derived images.

Nov 11 '19 12:11 bertsky

and usually does not have an info['dpi']

Why can't we check whether such information is available? Copy it, if that's the case and delete the default value if not? We should definitely not write default values which we know to be wrong (I added the label bug).

Nov 11 '19 12:11 wrznr

and usually does not have an info['dpi']

Why can't we check whether such information is available?

We could, but as I said: usually it is not. Or should I say: almost always. Any non-mutating PIL.Image operation (even those that wouldn't affect the meta-data) omits info from the result. We cannot change that. And we cannot force processors to give us much more than what PIL.Image does.

We should definitely not write default values which we know to be wrong (I added the label bug).

Okay, but that's probably a bug in Pillow's save for the PNG format.

Nov 11 '19 12:11 bertsky

We should definitely not write default values which we know to be wrong (I added the label bug).

Okay, but that's probably a bug in Pillow's save for the PNG format.

Oh, just found that this appears to have been fixed in Pillow 6.2.1. Since we cannot likely get any better, I will close for now.

Nov 11 '19 13:11 bertsky

https://github.com/python-pillow/Pillow/blob/master/CHANGES.rst#621-2019-10-21

Should we update to 6.2.1 then? Do you have a sample for me to test? Thanks!

Nov 11 '19 13:11 kba

Sorry, I was imprecise: It could have been fixed in an earlier version already.

I saw the following with 5.4.1 (leading up to this issue):

python -c "import PIL.Image; PIL.Image.open('repo/data/assets/scribo-test/data/OCR-D-IMG/OCR-D-IMG-orig_tiff.tif').save('test.png')"
identify -verbose test.png | grep Resolution:
 Resolution: 72x72

However, this does not happen any more with 6.2.1 (where correctly no DPI is saved).

Nov 11 '19 13:11 bertsky

@wrznr just convinced me that we should indeed take action to ensure core and modules comply with the spec (which requires PPI information to be kept for derived images, cf. OCR-D/spec#137).

Jan 07 '20 14:01 bertsky

Not so sure about the time frame for this though. Since it involves patching all modules, I guess the final workshop is out of the question. Setting 3.0 to take out the pressure.

Jan 07 '20 14:01 bertsky

Since it involves patching all modules

If we just assume the derived image passed to save_image_file must come from the (cropped/deskewed/dewarped/binarized/denoised/...) original without rescaling somehow, then we'll only need to patch core (finding the original image with the same pageId and looking at its OcrdExif). Of course we could still allow processors to specify a DPI themselves (in case they did rescale, or just to speed up).

Sep 21 '20 14:09 bertsky