pdfparser icon indicating copy to clipboard operation
pdfparser copied to clipboard

Parsing with unknown text. Help me resolve

Open rg-str opened this issue 1 year ago • 7 comments

  • PHP Version: 7.4
  • PDFParser Version: 2.9.0

Description:

PDF input

Cannot provide pdf since its confidential

Expected output & actual output

Need to extract table from it

Code

$parser = new \Smalot\PdfParser\Parser();

// Source PDF file to extract text $file = "tables 2024.pdf";

// Parse pdf file using Parser library //$pdf = $parser->parseFile($file);

$pdf = $parser->parseContent(file_get_contents($file));

// Extract text from PDF //$text = $pdf->getText(); $text = $pdf->getPages()[2]->getText(); // Add line break $pdfText = nl2br($text);

/$ascii_decoded = mb_convert_encoding($pdfText, 'UTF-8', 'ASCII'); $ansi_decoded = mb_convert_encoding($ascii_decoded, 'UTF-8', 'ISO-8859-1'); $decode1252 = mb_convert_encoding($ansi_decoded, 'UTF-8', 'Windows-1252'); $utf8_decode = utf8_decode($decode1252);/

$encodings = ['UTF-8', 'ISO-8859-1', 'Windows-1252', 'Windows-1251', 'ISO-8859-15']; $decodedText = $pdfText; foreach ($encodings as $encoding) { $decodedText = mb_convert_encoding($decodedText, 'UTF-8', $encoding); if ($decodedText) { // If decoding is successful, break the loop //break; } } $utf8_decode = utf8_decode($decodedText);

print_r($utf8_decode);

The output.. not working ZZZFDOFKRLFHFRP 5HJXODWRU\B6WDWXVBB

%HQHILWV3ODQVQRWHGDV³3HQGLQJ$SSURYDO´KDYHEHHQILOHGZLWKWKH&DOLIRUQLDUHJXODWLQJVWDWHDJHQFLHVDQGDUHSHQGLQJ DSSURYDOZLWKWKRVHVWDWHDJHQFLHV PLATINUM TIER %HQHILW3ODQ+HDOWK3ODQ1HWZRUN1DPH 5HJXODWRU\6WDWXV

rg-str avatar Apr 04 '24 14:04 rg-str

PdfParser should output UTF-8 encoded text by default, so I'm not sure what all your mb_converts and utf8_decodes after the getText() are doing.

What's the value of $text right after the getText() ?

GreyWyvern avatar Apr 04 '24 17:04 GreyWyvern

PdfParser should output UTF-8 encoded text by default, so I'm not sure what all your mb_converts and utf8_decodes after the getText() are doing.

What's the value of $text right after the getText() ?

$textright after thegetText() -> same value.... output not changed..

rg-str avatar Apr 05 '24 06:04 rg-str

We need to see at least some of the original output of getText(). It has the same value of what?

k00ni avatar Apr 05 '24 08:04 k00ni

We need to see at least some of the original output of getText(). It has the same value of what?

```

$pdf = $parser->parseFile($pdfFilePath); $pages = $pdf->getPages(); $text = $pdf->getPages()[2]->getText(); print_r($text);


The printed i got from browser as below,
 ZZZFDOFKRLFHFRP  5HJXODWRU\B6WDWXVBB %HQHILWV3ODQVQRWHGDV³3HQGLQJ$SSURYDO´KDYHEHHQILOHGZLWKWKH&DOLIRUQLDUHJXODWLQJVWDWHDJHQFLHVDQGDUHSHQGLQJ DSSURYDOZLWKWKRVHVWDWHDJHQFLHV  PLATINUM TIER %HQHILW3ODQ+HDOWK3ODQ1HWZRUN1DPH 5HJXODWRU\6WDWXV +02$ $QWKHP%OXH&URVV6HOHFW+02 $33529(' +02& +HDOWK1HW:KROH&DUH 3(1',1*$33529$/ +02' +HDOWK1HW6DOXG+02\0DV 3(1',1*$33529$/ +02( +HDOWK1HW)XOO 3(1',1*$33529$/ +02) +HDOWK1HW :KROH&DUH 3(1',1*$33529$/ +02* +HDOWK1HW 6DOXG+02\0DV 3(1',1*$33529$/ +02+ +HDOWK1HW )XOO 3(1',1*$33529$/ +02, +HDOWK1HW 6PDUW&DUH 3(1',1*$33529$/ +02- +HDOWK1HW 6PDUW&DUH 3(1',1*$33529$/ +02$ .DLVHU3HUPDQHQWH)XOO $33529(' +02% .DLVHU3HUPDQHQWH)XOO $33529(' HMO C* Kaiser Permanente Full APPROVED +02$ 6KDUS+HDOWK3ODQ3UHPLHU $33529(' +02% 6KDUS+HDOWK3ODQ3HUIRUPDQFH $33529(' +02& 6KDUS+HDOWK3ODQ3UHPLHU $33529(' +02$ 6XWWHU+HDOWK3OXV6XWWHU+HDOWK3OXV $33529(' +02% 6XWWHU+HDOWK3OXV6XWWHU+HDOWK3OXV $33529(' +02$ 8QLWHG+HDOWKFDUH6LJQDWXUH9DOXH 3(1',1*$33529$/ +02% 8QLWHG+HDOWKFDUH6LJQDWXUH9DOXH 3(1',1*$33529$/ +02& 8QLWHG+HDOWKFDUH$OOLDQFH 3(1',1*$33529$/ +02( 8QLWHG+HDOWKFDUH6LJQDWXUH9DOXH 3(1',1*$33529$/ +02* 8QLWHG+HDOWKFDUH$OOLDQFH 3(1',1*$33529$/ +02+ 8QLWHG+HDOWKFDUH+DUPRQ\ 3(1',1*$33529$/ +02, 8QLWHG+HDOWKFDUH+DUPRQ\ 3(1',1*$33529$/ +02- 8QLWHG+HDOWKFDUH$OOLDQFH 3(1',1*$33529$/ +02. 8QLWHG+HDOWKFDUH+DUPRQ\ 3(1',1*$33529$/ +02/ 8QLWHG+HDOWKFDUH6LJQDWXUH9DOXH 3(1',1*$33529$/ +020 8QLWHG+HDOWKFDUH+DUPRQ\ 3(1',1*$33529$/ +021 8QLWHG+HDOWKFDUH$OOLDQFH 3(1',1*$33529$/ +02$ :HVWHUQ+HDOWK$GYDQWDJH)XOO $33529(' +02% :HVWHUQ+HDOWK$GYDQWDJH)XOO $33529(' +02& :HVWHUQ+HDOWK$GYDQWDJH)XOO $33529(' (32& &LJQD2VFDU/RFDO3OXV 3(1',1*$33529$/ (32( &LJQD2VFDU/RFDO3OXV 3(1',1*$33529$/ (32) &LJQD2VFDU2SHQ$FFHVV3OXV 3(1',1*$33529$/ (32* &LJQD2VFDU2SHQ$FFHVV3OXV 3(1',1*$33529$/ 332$ $QWKHP%OXH&URVV3UXGHQW%X\HU±6PDOO*URXS$33529(' * New Plan  Regulatory Status Status as of December 5, 2023  ZZZFDOFKRLFHFRP  5HJXODWRU\B6WDWXVBB %HQHILWV3ODQVQRWHGDV³3HQGLQJ$SSURYDO´KDYHEHHQILOHGZLWKWKH&DOLIRUQLDUHJXODWLQJVWDWHDJHQFLHVDQGDUHSHQGLQJ DSSURYDOZLWKWKRVHVWDWHDJHQFLHV  PLATINUM TIER %HQHILW3ODQ+HDOWK3ODQ1HWZRUN1DPH 5HJXODWRU\6WDWXV +02$ $QWKHP%OXH&URVV6HOHFW+02 $33529(' +02& +HDOWK1HW:KROH&DUH 3(1',1*$33529$/ +02' +HDOWK1HW6DOXG+02\0DV 3(1',1*$33529$/ +02( +HDOWK1HW)XOO 3(1',1*$33529$/ +02) +HDOWK1HW :KROH&DUH 3(1',1*$33529$/ +02* +HDOWK1HW 6DOXG+02\0DV 3(1',1*$33529$/ +02+ +HDOWK1HW )XOO 3(1',1*$33529$/ +02, +HDOWK1HW 6PDUW&DUH 3(1',1*$33529$/ +02- +HDOWK1HW 6PDUW&DUH 3(1',1*$33529$/ +02$ .DLVHU3HUPDQHQWH)XOO $33529(' +02% .DLVHU3HUPDQHQWH)XOO $33529(' HMO C* Kaiser Permanente Full APPROVED +02$ 6KDUS+HDOWK3ODQ3UHPLHU $33529(' +02% 6KDUS+HDOWK3ODQ3HUIRUPDQFH $33529(' +02& 6KDUS+HDOWK3ODQ3UHPLHU $33529(' +02$ 6XWWHU+HDOWK3OXV6XWWHU+HDOWK3OXV $33529(' +02% 6XWWHU+HDOWK3OXV6XWWHU+HDOWK3OXV $33529(' +02$ 8QLWHG+HDOWKFDUH6LJQDWXUH9DOXH 3(1',1*$33529$/ +02% 8QLWHG+HDOWKFDUH6LJQDWXUH9DOXH 3(1',1*$33529$/ +02& 8QLWHG+HDOWKFDUH$OOLDQFH 3(1',1*$33529$/ +02( 8QLWHG+HDOWKFDUH6LJQDWXUH9DOXH 3(1',1*$33529$/ +02* 8QLWHG+HDOWKFDUH$OOLDQFH 3(1',1*$33529$/ +02+ 8QLWHG+HDOWKFDUH+DUPRQ\ 3(1',1*$33529$/ +02, 8QLWHG+HDOWKFDUH+DUPRQ\ 3(1',1*$33529$/ +02- 8QLWHG+HDOWKFDUH$OOLDQFH 3(1',1*$33529$/ +02. 8QLWHG+HDOWKFDUH+DUPRQ\ 3(1',1*$33529$/ +02/ 8QLWHG+HDOWKFDUH6LJQDWXUH9DOXH 3(1',1*$33529$/ +020 8QLWHG+HDOWKFDUH+DUPRQ\ 3(1',1*$33529$/ +021 8QLWHG+HDOWKFDUH$OOLDQFH 3(1',1*$33529$/ +02$ :HVWHUQ+HDOWK$GYDQWDJH)XOO $33529(' +02% :HVWHUQ+HDOWK$GYDQWDJH)XOO $33529(' +02& :HVWHUQ+HDOWK$GYDQWDJH)XOO $33529(' (32& &LJQD2VFDU/RFDO3OXV 3(1',1*$33529$/ (32( &LJQD2VFDU/RFDO3OXV 3(1',1*$33529$/ (32) &LJQD2VFDU2SHQ$FFHVV3OXV 3(1',1*$33529$/ (32* &LJQD2VFDU2SHQ$FFHVV3OXV 3(1',1*$33529$/ 332$ $QWKHP%OXH&URVV3UXGHQW%X\HU±6PDOO*URXS$33529(' * New Plan  Regulatory Status Status as of December 5, 2023


Hope this helps.. thanks

rg-str avatar Apr 05 '24 10:04 rg-str

test1.pdf for the above pdf.. getText() has return empty Please help me If possible, i wish to return this in a table format.. so i can generate a csv/excel

rg-str avatar Apr 05 '24 11:04 rg-str

test1.pdf for the above pdf.. getText() has return empty

There is no readable / editable text in the document; it is a scanned image. OCR would be required to extract the text.

GreyWyvern avatar Apr 05 '24 18:04 GreyWyvern

test1.pdf for the above pdf.. getText() has return empty

There is no readable / editable text in the document; it is a scanned image. OCR would be required to extract the text.

Any idea to distinguish both readable text and images and do ocr extraction.. and do we have any option for ocr extraction with smalot library.. if not please help me on any ocr library for that...

rg-str avatar Apr 08 '24 05:04 rg-str