Simon Platten

Results 14 comments of Simon Platten

I think what I am asking can be achieved by getting tesseract to use the "hocr" option which will cause it to output html which includes box coordinates for each...

Ok, I've modified tesseract.js inserting: command.push("hocr"); at line 70, this results in the output being HTML with box coordinates for every text item, is there another way of doing without...

After searching around, it seems the built in supported way to do this is to add a 'format' option to the options array specifying 'hocr' as the value. [edit]...unfortunately it...

Does anyone maintain this module anymore?

@reecefenwick, thank you, I did a search around today and from what I was able to find node-tesseract seems to be the best module for node.js I will modify the...

Can you post a bit more of the source code, at least the rest of the parameters, there isn't much to go on...but it looks like path isn't a string...

To translate the page units to any coordinate system do this: Take the width and height returned in the parsed pdf object, then work out scale factors according to the...

I have the same question, and what units are they in? Looking at pdffont.js in the /lib folder, on like 337 the comment describes sw as the font space width....

Is it possible to get the bounding rectangle without rendering the document? I am using the module to scan a PDF and extract text at specific locations which I would...

Relative to what? As far as I know its relative to the anchor position, so just add that on?