pdf2htmlEX icon indicating copy to clipboard operation
pdf2htmlEX copied to clipboard

no spaces?

Open RNCTX opened this issue 8 years ago • 1 comments

As mentioned in another thread I'm using this tool along with some homemade scripts to generate a fixed size epub3, and everything works great except... no spaces (at all).

Using poppler extraction (pdftotext) and Adobe Acrobat Pro 10, the text layer of the PDF has minimal spacing errors. --tounicode 1 didn't make any difference.

my options...

/usr/local/bin/pdf2htmlEX --embed-css 0 --embed-font 0 --embed-image 0 --embed-javascript 0 --embed-outline 0 --split-pages 1 --bg-format svg --hdpi $dpi --vdpi $dpi --fit-width $hdpi --fit-height $vdpi --page-filename mybook%04d.page --css-filename mybook.css mybook.pdf

(fit-width and fit-height from user input, hdpi and vdpi from imagemagick math on the original file)

example...

asdf.pdf

this file has been extracted as PNGs, OCR'd with Tesseract v4, and then re-assembled to ensure a clean PDF to work with.

copy/paste from page one of my pdf2htmlEX output...

“takeitallback,andsureenoughthat'sgoingtocomebutitwilltaketime.Firstofallletusaskarathersimplequestion.Howcanwebesure,howcanwetell,whetheranyutteranceistobeclassedas aperformativeornot?Surely,wefeel,weoughttobeabletodothat.Andweshouldobviouslyverymuchliketobeabletosaythatthereisagrammaticalcriterionforthis,somegrammaticalmeansofdecidingwhetheranutteranceisperformative.AlltheexamplesIhavegivenhithertodoin facthavethesamegrammaticalform;theyallofthembegin withtheverbinthefirstpersonsingularpresentindicativeactive-notjustanykindof verbofcourse,butstilltheyallareinfactofthatform.Furthermore,withthese verbsthatIhaveusedthereisatypicalasymmetrybetweentheuseofthispersonandtenseoftheverbandtheuseof thesameverbinotherpersonsandothertenses,andthisasym-metryisratheranimportantclue.Forexample,whenwesay'Ipromisethat...',thecaseisverydifferentfromwhenwesay'Hepromisesthat...',orinthepasttense'Ipromisedthat...'.Forwhenwesay'Ipromisethat;..'wedoperform anactofpromising-wegiveapromise.Whatwedonotdoistoreportonsomebody'sperforminganactofpromising-inparticular,we.donotreportonsome-body'suseof theexpression'Ipromise'.Weactuallydouseitanddothepromising.ButifIsay'Hepromises',orinthepasttense'Ipromised',Ipreciselydoreportonanactofpromising,thatistosayanactofusingthisformula'Ipromise'-Ireportonapresentactofpromisingbyhim,oronapastactofmyown.Thereisthusacleardifferencebetweenourfirstpersonsingularpresentindicativeactive,andotherpersonsandtenses.ThisisbroughtoutbythetypicalincidentoflittleWilliewhoseunclesayshe'llgivehimhalf-a-crown>ifhepromisesnevertosmoketillhe's55.LittleWillie'sanxiousparentwillsay'Ofcoursehepromise”

RNCTX avatar Feb 10 '17 01:02 RNCTX

update to this...

Assuming pdf2htmlEX is using PDF.js for rendering...

https://github.com/tesseract-ocr/tesseract/issues/712

One word per line, which ends up being no spaces at all in the final result, is what I can confirm. But I think in my case it is probably related to the PDF input I'm using from tesseract.

RNCTX avatar Feb 10 '17 23:02 RNCTX