pdfcpu PDF-2.0 support

Please make pdfcpu support "%PDF-2.0" headers, even if nothing else. PDF 2.0 (ISO 32000-2) has been around since 2017 so this would be very useful, even if validate command just generated a warning about otherwise lacking further support.

Test PDF attached.

$ pdfcpu validate pdfcpu-2.0.pdf validating(mode=relaxed) pdfcpu-2.0.pdf ... Read: xRefTable failed: headerVersion: unknown PDF Header Version: 2.0

pdfcpu-2.0.pdf

Dec 29 '20 10:12 petervwyatt

Hello!

We are still very much focused on providing better and more robust support for PDF1.x. Support for PDF2.0 will be tackled as soon as we get a hand on the spec.

This maybe a stupid question because I am not familiar with PDF2.0: Who seems to be the PDFWriter that created this sample?

I read smth that the info dict has gone in PDF2.0 or may be optional but shouldn't there be some sort of MetaInfo providing this info? Otherwise files like these seem rather suspicious.. Again, like I said, I am not familiar with 2.0.

I think just accepting a PDF2.0 header is not the way to go here without understanding its implications and to understand these I would need the spec so please bear with me.

Thank you for using pdfcpu 💚

PS: We have something similar pending: #251

Jan 07 '21 20:01 hhrutter

Again, this was a hand-edited PDF because I can't share the real-world PDF. My issue reduced down to a cause of "%PDF-1.x" vs "%PDF-2.0" in the first line header comment. I really like your validation and -v/-vv technology so I'd be more than happy with a command-line override to allow processing with a big "buyer beware" caveat.

In answer to your other PDF 2.0 questions: ISO 32000-2 is a hugely significant improvement over the PDF 1.7 ISO 32000-1 fast tracked doc as it was developed by a full consensus process in ISO. Many many things were clarified and are now much better spec'd. View it as 95+% a better written, less ambiguous, more precise version of PDF 1.7 - so if you are ever in doubt over what PDF 1.7 means refer to the PDF 2.0 spec and hopefully your questions will get answered!

Basically PDF 2.0 isn't really much different to PDF 1.7 - it's the same COS syntax excepting that there are now UTF-8 strings as well as UTF-16BE and with a lot tighter set of requirements about various lexical behaviors (previously unstated). Yes there are a bunch of new features but parsers skip over those like you skip over proprietary extensions now. The DocInfo stuff is deprecated in that the ISO consensus was that XMP is preferred (and XMP is already mandated in all PDF ISO subsets for decades) - but it's not outright banned. And it's still needed for Articles/Beads. "Deprecated" in PDF 2.0 is formally defined as "should not write" but its not illegal. Any existing PDF 1.x file that complies with PDF 1.7 can theoretically be made PDF 2.0 by just changing its header.

And, definitely not meaning to be snarky, both PDF 1.7 and PDF 2.0 specs state in Annex I PDF Versions and Compatibility: "A conforming readers shall attempt to read any PDF file, even if the file’s version is more recent than that of the conforming reader." :-)

Jan 08 '21 00:01 petervwyatt

👍 point taken

I really don't want to disappoint users but I just don't feel comfortable adding a hack to digest the PDF 2.0 header without taking the consequences for the project into account. Even if I include a warning. This would just confuse users I think and create lots of issues plus it is not that straight forward. Validation is used not only on the CLI but all over the place. It is scattered around the whole codebase and there are also API users that I don't want to put off.

Rest assured pdfcpu is slated to support PDF 2.0 at some point.

Meanwhile I suggest you play around in a pdfcpu fork. Your starting point for this endeavour: pdfcpu/pkg/pdfcpu/version.go

Thank you for considering pdfcpu

Jan 08 '21 22:01 hhrutter

Thanks for considering. If/when you get around to thinking more about it, please reach out if you have any questions. Will give it a go... (pun intended).

Jan 09 '21 00:01 petervwyatt

@petervwyatt If you want to check 2.0 documents against 1.7 validation, why simply not replacing 2.0 with 1.7 before the check (for example with sed) ? or am I missing something ?

Feb 11 '21 11:02 kpym

@kpym Because PDF 2.0 introduced new basic UTF-8 string type as well as AES-256 and Unicode passwords. So a PDF 1.x can generally be "re-versioned" up to PDF 2.0 by changing the header and Version key (subject to obsoleted and deprecated features), but that is not necessarily true in reverse.

Feb 14 '21 07:02 petervwyatt

@petervwyatt I was not saying that we can validate general 2.0 files by just replacing 2.0 by 1.7. I was answering in relation to your concrete situation (sorry if I wasn't clear):

My issue reduced down to a cause of "%PDF-1.x" vs "%PDF-2.0" in the first line header comment. I really like your validation and -v-vv technology so I'd be more than happy with a command-line override to allow processing with a big "buyer beware" caveat.

Waiting for real 2.0 validation, instead to modify pdfcpu to "generated a warning about otherwise lacking further support", probably an easy solution is to tweak the pdf prior to pass it to pdfcpu ?

Feb 14 '21 08:02 kpym

@kpym Yeap - definitely part of it, as is checking the current implementation against the far more detailed and precise definitions that PDF 2.0 defines for many aspects of syntax, degenerate cases and data integrity (previously unstated). I just have to increase my Go skills first :-)

Feb 14 '21 10:02 petervwyatt

Are there any updates for this?

Jan 11 '22 21:01 logansam

nope the work on form support is ongoing and followed by signature handling.

Jan 11 '22 21:01 hhrutter

Are there any further updates for PDF-2.0 support?

Jul 26 '23 17:07 ajohnson-ls

It's in the pipeline that's all I can tell you for now.

Jul 27 '23 07:07 hhrutter

Basic validation is part of https://github.com/pdfcpu/pdfcpu/releases/tag/v0.6.0

If anybody encounters PDF 2.0 validation issues please open a new issue - Thank you!

Dec 10 '23 19:12 hhrutter