pdfplumber icon indicating copy to clipboard operation
pdfplumber copied to clipboard

Functions that can be multi-threaded - Enhancement to documentation

Open sandzone opened this issue 2 years ago • 5 comments

With reference to #91

Is extract_tables the only function with this issue?

I am using multiprocessing with extract_words and haven't faced this issue so far. I wonder if this is just luck or if extract_words doesn't depend on document-wide ._tokens issue that @jsvine mentioned in #91

It will be very helpful if this aspect is mentioned in the documentation.

sandzone avatar Sep 22 '23 05:09 sandzone

Interesting. My best guess is "just luck," since they use the same underlying PDF-parsing process.

jsvine avatar Sep 22 '23 19:09 jsvine

I was able to use multi-threading no problem :) You need to use ThreadPoolExecutor instead of underlying low-level threading.Thread

Pk13055 avatar Nov 11 '23 13:11 Pk13055

Thanks for the note, @Pk13055! Are you able to share some code that demonstrates your approach?

jsvine avatar Nov 13 '23 22:11 jsvine

Here's a small example I put together. It may not run off-the-bat, but should provide a general idea:

from asyncio import gather, ensure_future, get_event_loop, run

import pdfplumber


async def process_page(page):

    processed = page.extract_tables()
    # do other stuff with page
    return processed

async def main():
    pdf = pdfplumber.open("test.pdf")
    loop = get_event_loop()
    futures = []
    for pg_idx in range(len(pdf.pages)):
        page = pdf.pages[pg_idx]
        futures.append(ensure_future(process_page(page), loop=loop))
    await gather(*futures)


if __name__ == "__main__":
    run(main())

I found this approach to be much faster than using a ThreadPoolExecutor, but here's an example anyway:

from concurrent.futures import ThreadPoolExecutor, as_completed
from asyncio import run

import pdfplumber


async def process_page(page):
    processed = page.extract_tables()
    # do other stuff with page
    return processed

async def main():
    pdf = pdfplumber.open("test.pdf")

    futures = []
    with ThreadPoolExecutor() as executor:
        for pg_idx in range(len(pdf.pages)):
            page = pdf.pages[pg_idx]
            futures.append(executor.submit(process_page, page))

    for res in as_completed(futures):
        processed = res.result()
        # do something with processed


if __name__ == "__main__":
    run(main())

Pk13055 avatar Nov 16 '23 12:11 Pk13055

Thanks! @sandzone: Does @Pk13055's approach work for you?

jsvine avatar Nov 17 '23 21:11 jsvine