pdfparser icon indicating copy to clipboard operation
pdfparser copied to clipboard

some chinese fonts mojibake

Open puppet-king opened this issue 3 years ago • 0 comments

  • PHP Version:
  • PDFParser Version: v2.3.0

Description:

\Smalot\PdfParser\Parser::parseFile file_get_contents Wrong escape of characters into ”>\b“ , Should be '>’ image

So it leads to $this->tableSizes['from'] errors,Become 3 characters,cause mojibake.

now I can only modify here,
vendor/smalot/pdfparser/src/Smalot/PdfParser/Font.php image

PDF input

FontType

Array
(
    [Name] => AAAANE+SimSun
    [Type] => Type0
    [Encoding] => Identity-H
    [BaseFont] => AAAANE+SimSun
    [DescendantFonts] => Array
        (
            [0] => Array
                (
                    [Name] => AAAANE+SimSun
                    [Type] => CIDFontType2
                    [Encoding] => Ansi
                    [BaseFont] => AAAANE+SimSun
                    [Subtype] => CIDFontType2
                )

        )

    [Subtype] => Type0
    [ToUnicode] => Array
        (
            [Filter] => FlateDecode
            [Length] => 236
        )

)

Expected output & actual output

expected output: 溪雨观酸菜鱼(临平路店) actual output:(N�&@ >\bfFY `�S:k�)Tj

Code

        $parser = new \Smalot\PdfParser\Parser();
        $pdf = $parser->parseFile($filepath);
        $text = $pdf->getText();

puppet-king avatar Feb 14 '23 05:02 puppet-king