OCR-Form-Tools icon indicating copy to clipboard operation
OCR-Form-Tools copied to clipboard

pdf_renderer: use pypdfium2 rather than deprecated pypdfium

Open mara004 opened this issue 4 years ago • 10 comments

Hello,

I'm a former maintainer of pypdfium and now co-author of pypdfium2. I noticed that this project is using pypdfium to rasterise PDFs, but it is now deprecated and succeeded by pypdfium2. We have applied several modernisations like platform specific wheel builds, automatic pdfium init/deinit calls and a small, pythonic support model API to facilitate rendering PDFs. pypdfium2 will be updated on a regular basis, while no further releases are planned for pypdfium.

This patch modifies utils/pdf_renderer.py to use pypdfium2, with the new support model API. If you wish to keep using the raw PDFium API, this is still possible, too.

https://github.com/pypdfium2-team/pypdfium2 https://pypi.org/project/pypdfium2/

mara004 avatar Dec 03 '21 15:12 mara004

CLA assistant check
All CLA requirements met.

ghost avatar Dec 03 '21 15:12 ghost

I just updated this PR to include the newer preprocessor/pdf_renderer.py, but you should really change your API to de-duplicate the code and load the document only once. It doesn't make sense at all to re-load the document in a separate method just to get page count. You may also want to take a look at pypdfium2's documentation; it provides a multi-page renderer with concurrency that may be more suitable for your use case.

mara004 avatar Sep 08 '22 13:09 mara004

Pipfile and requirements.txt still need to be updated properly, but I'm not familiar with this form of dependency pinning. Maybe a project member can finalise this?

mara004 avatar Sep 08 '22 15:09 mara004

@cschenio @buddhawang

mara004 avatar Sep 23 '22 10:09 mara004

@cschenio can you take a look? thanks!

buddhawang avatar Sep 27 '22 03:09 buddhawang

@mara004 thank you for revisiting this, let's see if I can de-dup the PDF loading logic.

cschenio avatar Sep 28 '22 03:09 cschenio

Thanks for the response! I'll need to update this PR again. It's quite some time ago that I initially submitted this, and a few things seem outdated now.

mara004 avatar Sep 28 '22 10:09 mara004

I force-pushed a commit that, I hope, nicely restructures rendering. I ran the test suite, which seems to work. Note that I had to replace the expected result because pypdfium2 uses RGB rather than RGBA where possible.

However, it looks like preprocess_multi_page_bundle() is currently not covered by tests, and I'm not sure how to invoke that function. Could you please check it still works as expected?

mara004 avatar Sep 29 '22 12:09 mara004

I think this is ready for review again.

mara004 avatar Sep 29 '22 13:09 mara004

I think this is ready for review again.

Good to know that, I will take on it lately.

cschenio avatar Oct 03 '22 03:10 cschenio

FYI, I am yet planning to release a new major version that will change the rendering API a bit. This will take some time. I plan to update the patch set when pypdfium2 v4 is released.

mara004 avatar Oct 22 '22 13:10 mara004

Coming back to this, I think the rewrite will still take quite some time, so you could also review/merge this before v4 is released and we can then update your code later in a following PR (which will be much smaller than this one).

mara004 avatar Nov 08 '22 17:11 mara004