no spaces?
As mentioned in another thread I'm using this tool along with some homemade scripts to generate a fixed size epub3, and everything works great except... no spaces (at all).
Using poppler extraction (pdftotext) and Adobe Acrobat Pro 10, the text layer of the PDF has minimal spacing errors. --tounicode 1 didn't make any difference.
my options...
/usr/local/bin/pdf2htmlEX --embed-css 0 --embed-font 0 --embed-image 0 --embed-javascript 0 --embed-outline 0 --split-pages 1 --bg-format svg --hdpi $dpi --vdpi $dpi --fit-width $hdpi --fit-height $vdpi --page-filename mybook%04d.page --css-filename mybook.css mybook.pdf
(fit-width and fit-height from user input, hdpi and vdpi from imagemagick math on the original file)
example...
this file has been extracted as PNGs, OCR'd with Tesseract v4, and then re-assembled to ensure a clean PDF to work with.
copy/paste from page one of my pdf2htmlEX output...
“takeitallback,andsureenoughthat'sgoingtocomebutitwilltaketime.Firstofallletusaskarathersimplequestion.Howcanwebesure,howcanwetell,whetheranyutteranceistobeclassedas aperformativeornot?Surely,wefeel,weoughttobeabletodothat.Andweshouldobviouslyverymuchliketobeabletosaythatthereisagrammaticalcriterionforthis,somegrammaticalmeansofdecidingwhetheranutteranceisperformative.AlltheexamplesIhavegivenhithertodoin facthavethesamegrammaticalform;theyallofthembegin withtheverbinthefirstpersonsingularpresentindicativeactive-notjustanykindof verbofcourse,butstilltheyallareinfactofthatform.Furthermore,withthese verbsthatIhaveusedthereisatypicalasymmetrybetweentheuseofthispersonandtenseoftheverbandtheuseof thesameverbinotherpersonsandothertenses,andthisasym-metryisratheranimportantclue.Forexample,whenwesay'Ipromisethat...',thecaseisverydifferentfromwhenwesay'Hepromisesthat...',orinthepasttense'Ipromisedthat...'.Forwhenwesay'Ipromisethat;..'wedoperform anactofpromising-wegiveapromise.Whatwedonotdoistoreportonsomebody'sperforminganactofpromising-inparticular,we.donotreportonsome-body'suseof theexpression'Ipromise'.Weactuallydouseitanddothepromising.ButifIsay'Hepromises',orinthepasttense'Ipromised',Ipreciselydoreportonanactofpromising,thatistosayanactofusingthisformula'Ipromise'-Ireportonapresentactofpromisingbyhim,oronapastactofmyown.Thereisthusacleardifferencebetweenourfirstpersonsingularpresentindicativeactive,andotherpersonsandtenses.ThisisbroughtoutbythetypicalincidentoflittleWilliewhoseunclesayshe'llgivehimhalf-a-crown>ifhepromisesnevertosmoketillhe's55.LittleWillie'sanxiousparentwillsay'Ofcoursehepromise”
update to this...
Assuming pdf2htmlEX is using PDF.js for rendering...
https://github.com/tesseract-ocr/tesseract/issues/712
One word per line, which ends up being no spaces at all in the final result, is what I can confirm. But I think in my case it is probably related to the PDF input I'm using from tesseract.