Fakabbir Amin
Fakabbir Amin
Can you share the steps or error you encountered during the process ?
@pietermarsman At present the html output is the best representation of the PDF. I think, what @hason had mentioned is to only extract text from the pdf. In that case,...
Sure, Let me grab some patience and time.
@jstockwin Yes, probably I would devote some time for this and some other issues too.
currently a fork of python-pdfbox is available which works smoothly. pip install python-pdfbox-v2
@mara004 As far as I remember, #29 was not merged or working when I discovered the breaking changes due to pdfbox v3. If #29 is working now, its great and...
The issue is mainly due to some conflicting dependency and under python 3.9. Try running with python3.10 fresh, things should get right.
Hi, pdf.xml files have to be generated via tesseract. https://github.com/fakabbir/OCR/blob/master/src/OCRScript.py#L23
Hi, OCR refers to extraction of text, In order to convert them to key value pairs, you would require rules which may not be exactly the way it's in this...
Its more of a way to revert the base environment, which is generally used by default