pdfparser
pdfparser copied to clipboard
Metadata content garbled for some PDFs
- PHP Version: 7.4.33
- PDFParser Version: v2.10.0
Description:
For some PDFs (e.g., attached) the metadata is garbled. This seems to be associated with PDF's that are encrypted, but I don't know enough about the PDF standard to know whether encryption also applies to metadata.
PDF input
Expected output & actual output
Output from mutool is what I expect, e.g. Title is SYSTEMATICS OF THE AMERICAN KATYDIDS \(ORTHOPTERA: TETTIGONIIDAE\). COMMUNICATION 2:
mutool info TZ_316_4_Gorochov.pdf
TZ_316_4_Gorochov.pdf:
PDF-1.6
Info object (68 0 R):
<</CreationDate(D:20121225141316+04'00')/Author(A.V. Gorochov)/Creator(PScript5.dll Version 5.2.2)/Producer(Acrobat Distiller 9.5.2 \(Windows\))/ModDate(D:20121225161815+04'00')/Title(SYSTEMATICS OF THE AMERICAN KATYDIDS \(ORTHOPTERA: TETTIGONIIDAE\). COMMUNICATION 2)>>
Encryption object (70 0 R):
<</Length 128/Filter/Standard/O<0EBA1908E5CD53B188213637794EA65838027C93E38494B55544F4375B294C90>/P -1036/R 3/U<8049AC430DA9683FBBC0F5C6392E856600000000000000000000000000000000>/V 2>>
Pages: 22
...
What I get from PdfParser is the following:
*** Metadata ***
Array
(
[CreationDate] => CŠtW“Ò˙Mð,¯š Wgá3agí
ÂQ©wèAuthor] => F…Iœ§E
[Creator] => Wþ%Õ’^J…Vt¾øt?[ºzqbäÿ#i
[Producer] => FÎÞ†^_é[²l÷Â}>ì:dyøí%
a¤»fi²å
[ModDate] => CŠtW“Ò˙Mð.¯Œ Tgá3agí
¨wèu³pô.@‘Ïˇ{@[òÜ¡ÐèU^éÛ3x=؈"¬OÔLŽOˆFêfl½‚,‹'f H‚6
[Pages] => 22
)
Code
<?php
// Example of PDF with bad characters
require_once (dirname(__FILE__) . '/vendor/autoload.php');
$filename = 'TZ_316_4_Gorochov.pdf';
$parser_config = new \Smalot\PdfParser\Config();
$parser_config->setRetainImageContent(false);
$parser_config->setIgnoreEncryption(true);
$parser = new \Smalot\PdfParser\Parser([], $parser_config);
// parse PDF
$pdf = $parser->parseFile($filename);
// Metadata
if (method_exists($pdf, 'getDetails'))
{
$metadata = $pdf->getDetails();
echo "*** Metadata ***\n";
print_r($metadata);
}
?>