pdfminer
pdfminer copied to clipboard
wrong decoding of custom decoded glyphs
Issue descriptions: After text extractions some characters came out wrong. In my case, all digits where replaced with special characters.
Detailed Descriptions: After further inspection, I notice that the pdf is using some kind of custom encoding. A "Differences" Tag/List is provided. However, the pdf uses some different glyph names (eg. 'one.oldstyle'). There is no function / test case in the encodingdb.name2unicode()-function do catch this case (and infer that it is most properly the glyph 'one').
Fix: see below
Fix: a little hacky, but anyways: in encodingdb.py added:
STRIP_PARTAFTERDOT = re.compile(r'[a-z]*')
in name2unicode(name) added:
m = STRIP_PARTAFTERDOT.search(name) [0]
if m:
return glyphname2unicode[m]```