pdf2json icon indicating copy to clipboard operation
pdf2json copied to clipboard

pdf2json Performance over large PDF

Open barneydunning opened this issue 9 years ago • 9 comments

Hi All,

I have a PDF file that contains about 500 pages (3.6mb) - I can't post because it contains sensitive data. When I load it up through pdf2json, it takes about 10 minutes to fire the dataReady callback... is this expected?

I am running the node application on an macbook pro, i7, 16GB... and seriously expected it to be faster.

The PDF contents are of a timetable nature... and all I want to extract are the text strings and their x/y locations for grouped by page.

Does anyone else have performance issues with pdf2json... or does anyone else have any suggestions as to other node modules to use for this purpose?

Looking forward to some help... and free to answer any questions.

Ta.

barneydunning avatar Jun 22 '16 17:06 barneydunning

the biggest pdf files in unit tests are under 8 pages, never tested it with 'large' file. If performance is an issue, I'd recommend to split it into smaller ones before parsing, since smaller pdfs are well tested and well performed.

modesty avatar Jul 02 '16 23:07 modesty

Hi there... thanks for the reply.

With so many downloads I am surprised no one else has hit this issue. The PDF files that we need to import are outwith our control, so we cannot lessen their size. They can be anything from one page to 1500 pages.

Are there any input options that cuts down the amount of work this plugin does when preparing the data? The only information we require is the textual data along with it's x and y coordinates.

Looking forward to your response.

Many thanks, Barney

barneydunning avatar Jul 04 '16 08:07 barneydunning

one option is to update the stream implementation from file to page, so the process starts to flow when a single page data is ready. It would improve responsiveness, but won't reduce the total processing time for large PDFs.

modesty avatar Jul 05 '16 18:07 modesty

Yep that's a shame. I take it there is no way of speeding up the process by limiting what it ends up outputting? So for example, asking it to only do specific types of work when loading the PDF document.

What would be the cause of the slowness... is it string manipulation or something similar in the inner workings of the module?

barneydunning avatar Jul 07 '16 15:07 barneydunning

We can use child process to process pages parallel. This will not only improve responsiveness but also reduce time for such large file. I would love to contribute and create PR for it if you think the same.

kishorsharma avatar Jul 29 '16 20:07 kishorsharma

I don't seem to have this issue. I have tried parsing 11mb PDF, and the dataReady callback fires in under a minute.

I am running the node application on my macbook pro, i5, 8GB.

Here's the PDF that i tested - https://drive.google.com/file/d/0BzR-ZOIycHumX3hsbTVWbFMyQlU/view?usp=sharing

AshishGogna avatar Aug 02 '16 14:08 AshishGogna

Sorry for the delay... damn holidays huh?! Well I am back now, so here goes...

Although the PDFs I am using are only ~4mb, each page (~1,300 pages) have a grid of tabular data (about 8x8)... and some "cells" can have up to 6x text items in - vertically placed. So it might not be about the size of the PDF, but rather the contents and their structure.

kishorsharma - if you could look into speeding this up using child processes, then I would be happy to test your code. Any advance on 10 minutes would be a big bonus!

Please let me know your thoughts.

barneydunning avatar Aug 10 '16 13:08 barneydunning

anything update?

wanghaisheng avatar Dec 23 '16 04:12 wanghaisheng

any update on this? How to split or when using per-page get raw text content?

therepo90 avatar Nov 28 '24 17:11 therepo90