pdf2htmlEX icon indicating copy to clipboard operation
pdf2htmlEX copied to clipboard

Improve --space-as-offset: determine spaces by unicode

Open duanyao opened this issue 11 years ago • 2 comments

Fix #445. Now --space-as-offset works on "unicode space" instead of ASCII SPACE before decoding the text. This change should also increases the oppotunities of converting spaces to offsets. However for PDFs with bad unicode support, this may still drop chars, though I haven't found an example yet.

duanyao avatar Nov 15 '14 05:11 duanyao

--space-as-offset may not guarantee to work if either the ToUnicode mapping for the font encoding is corrupted. In fact I had a few test cases before, where the font encoding is OK yet ToUnicode is missing or corrupted. According to my experience, there are more issues in the ToUnicode mappings, especially for old PDF files.

Seems that old PDF generators/converters were not able to handle this well -- after all this has nothing to do with printing. And ToUnicode is indeed optional in the standard.

I'm not sure if this is a good solution. Or possible we can take consideration of the --to-unicode parameter, that whether we trust the mapping.

coolwanglu avatar Nov 15 '14 05:11 coolwanglu

If ToUnicode is missing, can we just ignore --space-as-offset 1 for that font automaticly? We can also add --space-as-offset 2 to force it on even if ToUnicode is missing. However it seems impossible to detect whether ToUnicode or font encoding is corrupted.

duanyao avatar Nov 15 '14 06:11 duanyao