bc-java icon indicating copy to clipboard operation
bc-java copied to clipboard

PGPCanonicalizedDataGenerator possible perfromance improvements and missing advertising options

Open michkot opened this issue 3 years ago • 2 comments

@dghgit maybe I am nitpicking (sorry), I follow-up on https://stackoverflow.com/questions/40144840/bouncycastle-pgp-textmode-in-java-implementation-does-not-convert-to-cr-lf, https://github.com/bcgit/bc-java/issues/1035 and https://github.com/bcgit/bc-java/pull/1040 (thanks a lot for the improvement!):

  • PGPCanonicalizedDataGenerator does not mention BINARY anywhere in javadoc nor has it a BINARY format constant, but all of of the inner helper streams to facilitate the open() methods are aware of BINARY and are able to work with BINARY-format correctly! this is a missed "advertising" possibility, because the PGPCanonicalizedDataGenerator really is a standard-conforming drop-in replacement for PGPLiteralDataGenerator, and can handle format=BINARY too!
  • the ^mentioned BINARY handling is great, but IMO it is sub-optimal in performance; the open() method could just do return PGPLiteralDataGenerator.open(...) in those cases, instead if still creates a sub-class of CRLFGeneratorStream to wrap the PGPLiteralDataGenerator. And the potential "bottle-neck" is that CRLFGeneratorStream only implements the single-byte write(byte b), so even when isBinary is true, and no special action is taken, data is still processed in byte-per-byte manner even if write(byte b[]) has been called.
  • I believe PGPLiteralDataGenerator should mention PGPCanonicalizedDataGenerator in the class javadoc, saying that ensures text formated data is canonically encoded as per the RFC unlike the PGPLiteralDataGenerator.

For now, I will do "if (format == BINARY)" in our application and select either PGPLiteralDataGenerator or PGPCanonicalizedDataGenerator based on that check!

michkot avatar Jul 20 '22 10:07 michkot

Related, but not the main issue here, about indefinite-length inputs to the literal data generators :

  • I believe that PGPLiteralDataGenerator is missing an open() overload with indefinite length that (only) works with oldFormat= true, as the overload org.bouncycastle.openpgp.PGPLiteralDataGenerator#open(java.io.OutputStream, char, java.lang.String, java.util.Date, byte[]) is forcing new packet format implicitly in its call to org.bouncycastle.bcpg.BCPGOutputStream#BCPGOutputStream(java.io.OutputStream, int, byte[]) (same applies to PGPCanonicalizedDataGenerator); Maybe it is intentional, because these are hard to use (must be the last packet in a message)
  • I believe org.bouncycastle.openpgp.PGPCanonicalizedDataGenerator#open(java.io.OutputStream, char, java.lang.String, java.util.Date, java.io.File) is missing javadoc about how the backingFile is used and that everything is piped to the file first to compute the definite-length. The performance/memory/storage overhead should be mention in Javadoc
  • (related to above) I believe org.bouncycastle.openpgp.PGPCanonicalizedDataGenerator#open(java.io.OutputStream, char, java.lang.String, java.util.Date) should mention that this will buffer everything in memory before you do .close() and assumes you WANT to create a definite-length packet. Unlike with org.bouncycastle.openpgp.PGPLiteralDataGenerator#open(java.io.OutputStream, char, java.lang.String, long, java.util.Date) where length is an explicit parameter, this is not obvious from the method signature and/or its Javadoc

michkot avatar Jul 20 '22 10:07 michkot

It's not possible to use BINARY in conjunction with "canonicalization". It's really a text only thing. The standard actually says that a text signature has to be calculated over a canonicalized document as indeed it is. The canonicalized data generator is there to allow people to generate correct signatures and to reformat the text they are using so that an endpoint which fails to follow the standard will still work. Generally the approach of using the generator for text will work, but some platforms (which otherwise calculate signatures correctly) may struggle with changing the text data to include "\r\n". It all depends what you're dealing with.

dghgit avatar Jul 29 '22 12:07 dghgit

Assumed resolved.

dghgit avatar Aug 21 '22 08:08 dghgit

I believe I understand the trickiness and what the standards says about this specifically (that signature type 0x01 = text data signature, is always calculated with network-style=CRLF line-endings; and that literal data packets with format text/utf8 are to have the content stored network-style too), and that I also understood those when writing this.

My main point was about BC PGP (being great thing overall) usability - it has a steep learning curve (pgp itself is complex; no regular html docs for BC, just sample code + javadoc) and some traps lying around. To summarize: PGPCanonicalizedDataGenerator "hiddenly" supports BINARY format - inconsistent with its javadoc & non-performant; I find it (a bit) sad, given that it's the ideal class to use for trivial implementations as it follows the standard (e.g. you just feed your data into PGPCanonicalizedDataGenerator and into PGPSignatureGenerator, specifying the correct message and signature formats, and you will be RFC-standard conformant)

Note: In our solution, I ended up:

  • hacking PGPSignatureGenerator (wrapping PGPContentSignerBuilder in a proxy) to extract the PGPContentSigner/its getOutputStream, and use that OutputStream directly instead of using PGPSignatureGenerator#update (which I would have to wrap into an OutputStream myself anyway)
    • I do this to to circumvent the canonicalization of non-BINARY-content signatures that happens in update()
    • ~~this gives me potential performance boost if the underlying java Signature object supports block updates (and not only byte-by-byte updates)~~ -- see bellow why there likely is not any perf boost
  • always using PGPLiteralDataGenerator (not PGPCanonicalizedDataGenerator )
  • I wire up PGPLiteralDataGenerator-OutputStream and all the PGPSignatureGenerator-SignatureGenerating-OutputStreams together using TeeOuputStreams
  • if literal data format != binary -> I canonicalize (wrap) the "top" TeeOuputStream.
    • My canonicalization stream is a rip-off from the PGPSignatureGenerator#update; Interestingly, the logic used in PGPCanonicalizedDataGenerator is different for some reason (seems to have exactly the same semantics, but is unnecessarily more complex)

michkot avatar Sep 14 '22 11:09 michkot

Can you explain the circumstances under which you were seeing canonicalization being applied when not required in update?

dghgit avatar Sep 14 '22 11:09 dghgit

Can you explain the circumstances under which you were seeing canonicalization being applied when not required in update? Nowhere - the .update() is OK.

I said that in our solution I did:

circumvent the canonicalization of non-BINARY-content signatures

= when PGPSignatureGenerator tries to be "smart" and canonicalizes the bytes passed to update() of 0x01 signatures.

Since I do the canonicalization elsewhere (in a dedicated stream that gets teed into both literal data and the signers), it's redundant for me. I realized that there is no (measurable) performance gain here (edited the previous post) as for binary data, it passes through the byte array, and for text data, I am forcing single-byte operation in an earlier canonicalization stream in the pipeline anyway, so the second canonicalization does not hurt (much).

michkot avatar Sep 14 '22 13:09 michkot