PDFIO.jl icon indicating copy to clipboard operation
PDFIO.jl copied to clipboard

Table picker for PDF

Open sambitdash opened this issue 8 years ago • 4 comments

Natural tabular objects in a PDF document should ideally be picked up for extraction.

The intent of the project is API development, hence it will be headless for most part. There may not be a WYSIWYG picker available unlike a reader. A heuristic table picker should scan the document for existence of table like structures and dump them in tabular HTML/CSS format or extracted image objects. In cased document tagging is enabled, the table picker can use the tagged text.

sambitdash avatar Jul 12 '17 11:07 sambitdash

I have written some lines of code to extract tabular data. Currently it is keyword based to determine the textlayouts to include. I also managed to make short IJulia notebook where you can interactively select text in a Plotly chart. @sambitdash Would you be interested in including that code in your package? Otherwise I might release my own package but I feel that this functionality would nicely fit into PDFIO.

hhaensel avatar May 09 '22 07:05 hhaensel

@hhaensel thank you for your interest. I want to understand what level of complex cases can this software handle. If you submit a PR, I can review it and let you know if they are useful for this SDK.

sambitdash avatar May 09 '22 16:05 sambitdash

Sounds perfect, I'll submit a PR tomorrow. The code extracts a vector of TextLayouts as a function of page(s) and keywords, then scans for common elements in rows and columns as a function of their layout box. The layout boxes can be scaled in order to reduce the probability of overlapping areas. Optionally a Plotly graph displays the elements and their recognised arrangement with a color code.

Looking forward to your feedback.

hhaensel avatar May 09 '22 20:05 hhaensel

Sorry, currently in overload, will take some more time ...

hhaensel avatar May 18 '22 19:05 hhaensel