pdfparser
pdfparser copied to clipboard
some chinese fonts mojibake
- PHP Version:
- PDFParser Version: v2.3.0
Description:
\Smalot\PdfParser\Parser::parseFile file_get_contents Wrong escape of characters into ”>\b“ , Should be '>’

So it leads to $this->tableSizes['from'] errors,Become 3 characters,cause mojibake.
now I can only modify here,
vendor/smalot/pdfparser/src/Smalot/PdfParser/Font.php

PDF input
FontType
Array
(
[Name] => AAAANE+SimSun
[Type] => Type0
[Encoding] => Identity-H
[BaseFont] => AAAANE+SimSun
[DescendantFonts] => Array
(
[0] => Array
(
[Name] => AAAANE+SimSun
[Type] => CIDFontType2
[Encoding] => Ansi
[BaseFont] => AAAANE+SimSun
[Subtype] => CIDFontType2
)
)
[Subtype] => Type0
[ToUnicode] => Array
(
[Filter] => FlateDecode
[Length] => 236
)
)
Expected output & actual output
expected output: 溪雨观酸菜鱼(临平路店) actual output:(N�&@ >\bfFY `�S:k�)Tj
Code
$parser = new \Smalot\PdfParser\Parser();
$pdf = $parser->parseFile($filepath);
$text = $pdf->getText();