pdf-reader issues

PDF with multiple column doesn't extract text properly

2

PDF with multiple columns doesn’t extract text properly When I tried to extract text in a PDF with 2 columns style. The text is read in a row by row...

nus-kingsley

Page#text does not return all the text

5

For some reason `PDF::Reader#text` does not return all the text on a PDF file I'm scanning. Albeit I'm able to get the text by looking at the runs directly. Here...

3ynm

Extra spaces between letters in a single word

2

I noticed this gem has problems parsing some PDFs where the text is not necessarily clean. For instance, this file: https://www.jstor.org/stable/3684663 Some parts of it get output like: "a b...

pickhardt

Width in run elements is too small/col_count too high

I am not sure if this is a problem with the pdf itself, but it seems like when mapping the `mean_character_width` from `@runs` in initialize of lib/pdf/reader/page_layout.rb that the width...

mmatotan

PDF::Reader::MalformedPDFError (PDF malformed, expected 'endstream' but found '1' instead)

1

Trying to read this pdf --> [h7E6IP36VnmkCJjM3dWL_0.pdf](https://github.com/yob/pdf-reader/files/11969978/h7E6IP36VnmkCJjM3dWL_0.pdf) `PDF::Reader.new("h7E6IP36VnmkCJjM3dWL_0.pdf").page(1).text` Got PDF::Reader::MalformedPDFError (PDF malformed, expected 'endstream' but found '1' instead) Additional Info: gem "pdf-reader", "~> 2.11"

papayalabs

OpenSSL::Cipher::CipherError: bad decrypt

2

I have the following PDF, which is not encrypted, only locked for edits. [bad_decrypt.pdf](https://github.com/yob/pdf-reader/files/14349837/bad_decrypt.pdf) When trying to read it, it raises `OpenSSL::Cipher::CipherError: bad decrypt` error: ```ruby PDF::Reader.new("./bad_decrypt.pdf").pages /app/vendor/bundle/ruby/3.1.0/gems/pdf-reader-2.12.0/lib/pdf/reader/aes_v2_security_handler.rb:37:in `final': bad...

krystof-k

License issue with ttfunk

Hi there! I'm not a lawyer but `pdf-reader` is under MIT license (https://github.com/yob/pdf-reader/blob/main/MIT-LICENSE) and use `ttfunk` as a dependency which is under GPL2/GPL3 license (https://github.com/prawnpdf/ttfunk/blob/master/LICENSE). AFAIK mixing both is not...

n-rodriguez

FEATURE: Extract Paragraphs

3

We're using PDF::Reader at Zipline for parsing content out of PDFs. (I also forked this project on our team repo [here](https://github.com/retailzipline/pdf-reader).) We have a number of cases where we want...

judy

Circular references on Page Tree causes PDF::Reader to crash with `SystemStackError`

1

[Pages-tree-refs.pdf](https://github.com/yob/pdf-reader/files/13798724/Pages-tree-refs.pdf) ([source](https://github.com/mozilla/pdf.js/blob/master/test/pdfs/Pages-tree-refs.pdf)) Running the following script with the attached PDF renders the following error: ```ruby require "bundler/inline" gemfile do gem "pdf-reader" end PDF::Reader.new("Pages-tree-refs.pdf").pages # /usr/local/bundle/gems/pdf-reader-2.12.0/lib/pdf/reader/reference.rb:65:in `hash': stack level too deep...

tomascco

Numerals read as `\u0000` when using font feature settings

1

First of all, thanks for the work and effort you've put into this great library! ## Bug description We are having an issue with numerals not being read correctly by...

SimonEggert

pdf-reader
pdf-reader copied to clipboard

Metadata

PDF with multiple column doesn't extract text properly

Page#text does not return all the text

Extra spaces between letters in a single word

Width in run elements is too small/col_count too high

PDF::Reader::MalformedPDFError (PDF malformed, expected 'endstream' but found '1' instead)

OpenSSL::Cipher::CipherError: bad decrypt

License issue with ttfunk

FEATURE: Extract Paragraphs

Circular references on Page Tree causes PDF::Reader to crash with `SystemStackError`

Numerals read as `\u0000` when using font feature settings

← Metadata

Owner

Metadata

pdf-reader pdf-reader copied to clipboard

Metadata

← Metadata

Owner

Metadata

pdf-reader
pdf-reader copied to clipboard