PyPDF4 icon indicating copy to clipboard operation
PyPDF4 copied to clipboard

Wanted for testing: PDF files with specific features

Open acsor opened this issue 7 years ago • 1 comments

Some of the unit tests I have developed rely on PDF files that have certain features. In Calibre, I own a collection of 109+ PDF books, but amongst them I haven't met any that satisfy certain needs. In particular, I'm looking for:

  1. A PDF file with a /ASCIIHexDecode, or equivalently /AHx stream filter.
  2. A PDF file with a /JPXDecode stream filter.
  3. More PDF files whose objects have /Type equal to /ObjStm, that is to say files that rely on Cross-Reference streams (PDF 1.5+).
  4. A few other hybrid-reference files, as described in section 7.5.8.4 of ISO 32000: files that use a Cross-Reference Table to hide elements stored in a Cross-Reference Stream, understandable by PDF 1.5+ readers only.

The reason of this request is to satisfy the fixture data collection (in tests/fixture_data/ of my current PR #14) of the project. It seems a rarity to find a PDF file with these characteristics and I ask you.

I have performed my searches with a simple grep. For example, in case 2 I went like so:

grep -RPi --binary-files=text [--exclude-dir=<whatever you want>] "/JPXDecode" <arbitrary path>

acsor avatar Sep 23 '18 19:09 acsor

1. Files containing "/ASCIIHexDecode"

I found three of these, all scans of books I got from university that someone else created, but I don't feel comfortable publishing them. I could send them to you over a private channel if that would be any help.

asciihexdecode.pdf is one page that I extracted from one of these documents using pdfarranger with PyPdf2.

2. Files containing "/JPXDecode"

  • https://www.elwis.de/DE/Sportschifffahrt/Sportbootfuehrerscheine/Navigationsaufgaben-SKS.pdf?__blob=publicationFile&v=3
  • https://www.cs.uni-mainz.de/files/2018/02/00-LV-Info-SS2018-speicheropt-3.pdf
  • I have more files, but I don't want to publish them.

3. /Type equal to /ObjStm

How can I grep for those? The 00-LV-Info-SS2018-speicheropt-3.pdf contains <</Filter/FlateDecode/First 14/Length 343/N 2/Type/ObjStm>>stream, is this sufficient?

4. Other hypbrid-reference files

How would I identify those if I had them?

dreua avatar Mar 02 '19 17:03 dreua