wrong decoding of custom decoded glyphs

Open tillhainbach opened this issue 6 years ago • 1 comments

Issue descriptions: After text extractions some characters came out wrong. In my case, all digits where replaced with special characters.

Detailed Descriptions: After further inspection, I notice that the pdf is using some kind of custom encoding. A "Differences" Tag/List is provided. However, the pdf uses some different glyph names (eg. 'one.oldstyle'). There is no function / test case in the encodingdb.name2unicode()-function do catch this case (and infer that it is most properly the glyph 'one').

Fix: see below

Jan 08 '20 13:01 tillhainbach

Fix: a little hacky, but anyways: in encodingdb.py added:

STRIP_PARTAFTERDOT = re.compile(r'[a-z]*')

in name2unicode(name) added:

m = STRIP_PARTAFTERDOT.search(name) [0]
if m:
    return glyphname2unicode[m]```

Jan 08 '20 13:01 tillhainbach