Metadata content garbled for some PDFs

Open rdmpage opened this issue 1 year ago • 0 comments

PHP Version: 7.4.33
PDFParser Version: v2.10.0

Description:

For some PDFs (e.g., attached) the metadata is garbled. This seems to be associated with PDF's that are encrypted, but I don't know enough about the PDF standard to know whether encryption also applies to metadata.

PDF input

TZ_316_4_Gorochov.pdf

Expected output & actual output

Output from mutool is what I expect, e.g. Title is SYSTEMATICS OF THE AMERICAN KATYDIDS \(ORTHOPTERA: TETTIGONIIDAE\). COMMUNICATION 2:

mutool info TZ_316_4_Gorochov.pdf
TZ_316_4_Gorochov.pdf:

PDF-1.6
Info object (68 0 R):
<</CreationDate(D:20121225141316+04'00')/Author(A.V. Gorochov)/Creator(PScript5.dll Version 5.2.2)/Producer(Acrobat Distiller 9.5.2 \(Windows\))/ModDate(D:20121225161815+04'00')/Title(SYSTEMATICS OF THE AMERICAN KATYDIDS \(ORTHOPTERA: TETTIGONIIDAE\). COMMUNICATION 2)>>
Encryption object (70 0 R):
<</Length 128/Filter/Standard/O<0EBA1908E5CD53B188213637794EA65838027C93E38494B55544F4375B294C90>/P -1036/R 3/U<8049AC430DA9683FBBC0F5C6392E856600000000000000000000000000000000>/V 2>>
Pages: 22
...

What I get from PdfParser is the following:

*** Metadata ***
Array
(
    [CreationDate] => CŠtW“Ò˙Mð,¯š Wgá3agí
ÂQ©wèAuthor] => F…Iœ§E
    [Creator] => Wþ%Õ’^J…Vt¾øt?[ºzqbäÿ#i
    [Producer] => FÎÞ†^_é[²l÷Â}>ì:dyøí%
                                       a¤»ﬁ²å
    [ModDate] => CŠtW“Ò˙Mð.¯Œ Tgá3agí
¨wèu³pô.@‘Ïˇ{@[òÜ¡ÐèU^éÛ3x=Øˆ"¬OÔLŽOˆFêﬂ½‚,‹'f	H‚6
    [Pages] => 22
)

Code

<?php

// Example of PDF with bad characters

require_once (dirname(__FILE__) . '/vendor/autoload.php');

$filename = 'TZ_316_4_Gorochov.pdf';

$parser_config = new \Smalot\PdfParser\Config();
$parser_config->setRetainImageContent(false);
$parser_config->setIgnoreEncryption(true);

$parser = new \Smalot\PdfParser\Parser([], $parser_config);

// parse PDF
$pdf = $parser->parseFile($filename);
	
// Metadata
if (method_exists($pdf, 'getDetails'))
{
	$metadata = $pdf->getDetails();

	echo "*** Metadata ***\n";
	print_r($metadata); 

}

?>

Aug 01 '24 16:08 rdmpage