pdfparser icon indicating copy to clipboard operation
pdfparser copied to clipboard

'ti' ligature not parsed and/or displayed correctly

Open GreyWyvern opened this issue 2 years ago • 0 comments

In fonts such as Calibri, the pair of glyphs 't' and 'i' are encoded as a 'ti' ligature when converted to PDF. However, I don't believe there is actually a code-point for a 'ti' ligature in UTF-8, and since PdfParser tries to convert all extracted text to UTF-8, it shows up as a missing code-point.

Example PDF: What in tarnation.pdf

Considering that trying to copy-paste the text right from the PDF also results in an unknown 'ti' ligature glyph, I'm not sure this issue can be fixed within PdfParser. But the fact that the PDF displays the glyph properly suggests that it may... ? It's possible this might be another Identity-H encoding issue.

The bytes encoding the ligature are (I believe): f480869f

GreyWyvern avatar Sep 29 '23 16:09 GreyWyvern