spec icon indicating copy to clipboard operation
spec copied to clipboard

dataschema for signed or encrypted payload

Open duglin opened this issue 3 years ago • 12 comments

Hi,

As mentioned in some issues (#373 and #379) a good option to sign or encrypt data in a CloudEvent is use of one of the JOSE format (JWS or JWE) my question here is, should then the dataschema describe the JWS/JWE structure or should focus on the schema of the decrypted information? i mean the payload of the JWT?

Use the schema to describe a JWS/JWE seems to not have any value for the consumer but the specs say

Identifies the schema that data adheres to. so not sure what would be the correct approach here.

Best regards

Originally posted by @jolivaSan in https://github.com/cloudevents/spec/discussions/1077

duglin avatar Sep 09 '22 15:09 duglin

I believe it's meant to be the schema of the unencrypted payload.

@JemDay, since you did https://github.com/cloudevents/spec/pull/399/files, do you think we need to make this clear in the spec?

duglin avatar Sep 09 '22 15:09 duglin

I would suggest that if the the 'data' represents an encoded and/or signed payload then the 'dataschema' isn't really appropriate in that situation.

The datacontenttype would be something like application/jose or application/jwt.

I believe there's a pattern whereby the schema of the business data can be declared in a JWS header - I'll try and find a reference.

JemDay avatar Sep 09 '22 16:09 JemDay

The datacontenttype of the business data should be carried in the cty JWS Header. Ref

JemDay avatar Sep 09 '22 17:09 JemDay

Think we need an issue to clarify this in the spec?

duglin avatar Sep 09 '22 17:09 duglin

Hi,

Just some opinions about that

The datacontenttype would be something like application/jose or application/jwt.

I'm totally agree with that, the datacontenttype will indicate if the content is encrypted/signed or any other format.

I would suggest that if the the 'data' represents an encoded and/or signed payload then the 'dataschema' isn't really appropriate in that situation.

If the dataschema describe the JWS/JWE structure instead of the schema of the decrypted information then i see two main problems:

  • Is not going to be useful for the consumer or any tool as JWS/JWE are standards that are already defined, not too much changes expected here...
  • The considerations in the CloudEvents Premier about version and dataschema are not going to be applicable in this case, because the model that is going to change/evolve is going to be the one of the decrypted information....

jolivaSan avatar Sep 12 '22 12:09 jolivaSan

Maybe we can add an extension for the case that the data is encrypted.

cryptdatacontenttype will specify the data content type of the decrypted/signed data datacontenttype will specify the data content type of the encrypted payload cryptdataschema will specify the data schema of the decrypted/signed data

What do you think?

sasha-tkachev avatar Sep 14 '22 15:09 sasha-tkachev

I have three different use cases:

  • Event encryption - the whole event is encrypted
  • Payload encryption - only the data field in encrypted
  • Field encryption - data is structure (e.g JSON), only some fields are encrypted

CloudEvents with encrypted data should be convertible to unencrypted and vice-versa. Lets focus on payload encryption.

I'd like to understand what is being proposed here. I think it is that if you want to encrypted the data field of your event, you should use a JWE. JWE is nothing more than JWT where the content is encrypted.

{
   "datacontentype": "application/jwt",
   "data": "xxxxx.yyyyy.zzzzz"
}

The JWT has additional encryption headers (e.g. alg):

{
  "alg": "RSA-OAEP",
  "enc": "A256CBC-HS512",
  "kid": "abdd85",
  "typ": "JWT",
  "cty": "JWT"
}

Where does the datacontenttype of the encrypted content go? @sasha-tkachev pointed this out, and it also applies for dataschema (the encrypted data would not have the same schema).

The answer is, for JWT, nowhere. JWTs only encode JSON. The datacontenttype must be JSON.

JWT does not have non-JSON. Even if it did,

JWE has disadantages. When writing to Kafka, we cannot just write the bytes. We must write the base64 encoded bytes. This will increase network, CPU, memory, and disk costs.

Any solution should ensure that the data can be a binary document.

I cannot find anything on the multipart/encrypted standard, which implies it is not widely adopted or used. Minimally, we'd need to create a new MIME type to represent the control information (i.e. alg/enc) `application/ce-encrypted.

But this seems nice to me:

{
   "datacontentype": "application/octet-stream; enc=A256BC-HS512; kid=abdd85; cty=application/json",
   "data_base64": "xxxxx"
}

datacontenttype tells me everything I need to know about the data.

Another idea:

{
   "datacontentype": "application/jwt",
   "data": "xxxxx.yyyyy.zzzzz"
}

The JWT has additional encryption headers (e.g. alg):

{
  "alg": "RSA-OAEP",
  "enc": "A256CBC-HS512",
  "kid": "abdd85",
  "typ": "JWT",
  "cty": "JWT"
}

The content is the just the data and datacontenttype fields.

{
   "datacontentype": "application/json",
   "data": {"foo": "bar"}
}

alexec avatar Sep 16 '22 23:09 alexec

Hi @alexec,

Thanks for your comments, let me add some notes

  • Event encryption - the whole event is encrypted

As per the privacy and security section statement that say "Sensitive information SHOULD NOT be carried or represented in context attributes." I will focus just in the payload of the event for this, or what is the same the data field.

The answer is, for JWT, nowhere. JWTs only encode JSON. The datacontenttype must be JSON.

I think for the case of a JWS or JWE the correct datacontenttype should be application/jose or application/jose+json as explained in the rfc7515, concretely in this RFC say:

This section registers the "application/jose" media type [RFC2046] in the "Media Types" registry [IANA.MediaTypes] in the manner described in RFC 6838 [[RFC6838(https://datatracker.ietf.org/doc/html/rfc6838)], which can be used to indicate that the content is a JWS or JWE using the JWS Compact Serialization or the JWE Compact Serialization. This section also registers the "application/ jose+json" media type in the "Media Types" registry, which can be used to indicate that the content is a JWS or JWE using the JWS JSON Serialization or the JWE JSON Serialization.

So some possible example of an application/jose content type will looks like this for a JWE: (The compact serialization spec can be found in the rfc7516)

{
   "datacontentype": "application/jwt",
   "data": "eyJhbGciOiJSU0EtT0FFUCIsImVuYyI6IkEyNTZHQ00ifQ.
     OKOawDo13gRp2ojaHV7LFpZcgV7T6DVZKTyKOMTYUmKoTCVJRgckCL9kiMT03JGe
     ipsEdY3mx_etLbbWSrFr05kLzcSr4qKAq7YN7e9jwQRb23nfa6c9d-StnImGyFDb
     Sv04uVuxIp5Zms1gNxKKK2Da14B8S4rzVRltdYwam_lDp5XnZAYpQdb76FdIKLaV
     mqgfwX7XWRxv2322i-vDxRfqNzo_tETKzpVLzfiwQyeyPGLBIO56YJ7eObdv0je8
     1860ppamavo35UgoRdbYaBcoh9QcfylQr66oc6vFWXRcZ_ZT2LawVCWTIy3brGPi
     6UklfCpIMfIjf7iGdXKHzg.
     48V1_ALb6US04U3b.
     5eym8TW_c8SuK0ltJ3rpYIzOeDQz7TALvtu6UG9oMo4vpzs9tX_EFShS8iB7j6ji
     SdiwkIr3ajwQzaBtQD_A.
     XFBoMYUZodetZdvTiFvSkQ"
}

And an example of application/jose+json content type will looks like this:

{
   "datacontentype": "application/jose+json",
   "data": {
      "protected":
       "eyJlbmMiOiJBMTI4Q0JDLUhTMjU2In0",
      "unprotected":
       {"jku":"https://server.example.com/keys.jwks"},
      "recipients":[
       {"header":
         {"alg":"RSA1_5","kid":"2011-04-29"},
        "encrypted_key":
         "UGhIOguC7IuEvf_NPVaXsGMoLOmwvc1GyqlIKOK1nN94nHPoltGRhWhw7Zx0-
          kFm1NJn8LE9XShH59_i8J0PH5ZZyNfGy2xGdULU7sHNF6Gp2vPLgNZ__deLKx
          GHZ7PcHALUzoOegEI-8E66jX2E4zyJKx-YxzZIItRzC5hlRirb6Y5Cl_p-ko3
          YvkkysZIFNPccxRU7qve1WYPxqbb2Yw8kZqa2rMWI5ng8OtvzlV7elprCbuPh
          cCdZ6XDP0_F8rkXds2vE4X-ncOIM8hAYHHi29NX0mcKiRaD0-D-ljQTP-cFPg
          wCp6X-nZZd9OHBv-B3oWh2TbqmScqXMR4gp_A"},
       {"header":
         {"alg":"A128KW","kid":"7"},
        "encrypted_key":
         "6KB707dM9YTIgHtLvtgWQ8mKwboJW3of9locizkDTHzBC2IlrT1oOQ"}],
      "iv":
       "AxY8DCtDaGlsbGljb3RoZQ",
      "ciphertext":
       "KDlTtXchhZTGufMYmOYGS4HffxPSUrfmqCHXaI9wOGY",
      "tag":
       "Mz-VPPyU4RlcuYv1IwIvzw"
     }
}

JWE has disadantages. When writing to Kafka, we cannot just write the bytes. We must write the base64 encoded bytes. This will increase network, CPU, memory, and disk costs.

I get that point about performance but i still like the idea of use JWE as there is a complete specification about how to use it, consume it, serialize it... and there are tons of libraries ready to use out there....

The only concern from my point is, how can i express the schema of the data? as reading the cloud event spec seems like the dataschema should describe the data field so in this case should described the JWE or JWS structure, something that will not be useful for the consumer... and if we describe with dataschema the structure of the JWE/JWS payload (the event data after decrypt) then seems like we are not following the Cloud event spec...

I cannot find anything on the multipart/encrypted standard, which implies it is not widely adopted or used. Minimally, we'd need to create a new MIME type to represent the control information (i.e. alg/enc) `application/ce-encrypted.

I'm not going to say that this is not a good idea because of the performance issue... but I'm not sure if it is part of the objetives of Cloud Event to define a new spec for that... I just expected to understand how to "accommodate" the existing standard JWE with cloud event to make it works...

Best Regards.

jolivaSan avatar Sep 20 '22 09:09 jolivaSan

That all seem smart. What are you thoughts about non-JSON types? How do you encode XML or text for example?

alexec avatar Sep 20 '22 15:09 alexec

That all seem smart. What are you thoughts about non-JSON types? How do you encode XML or text for example?

True .. starting with a JWE example has the potential to be a little misleading given the standard header structure which allows for contextual information to be included directly.

If you start from a non-JWE position then you essentially end up potentially mirroring those contextual headers into a CE extension .. and the question then becomes "is that road we want/need to go down".

I think we can provide some "best practice" commentary for JWE and how CE attributes might map into JWE headers.

JemDay avatar Sep 20 '22 16:09 JemDay

The only concern from my point is, how can i express the schema of the data? as reading the cloud event spec seems like the dataschema should describe the data field so in this case should described the JWE or JWS structure, something that will not be useful for the consumer... and if we describe with dataschema the structure of the JWE/JWS payload (the event data after decrypt) then seems like we are not following the Cloud event spec...

Let's think about how dataschema might be used. I can immediately think of a few ways:

  • Message validation (and automatic handling of invalid messages)
  • Change management, evolution, versioning
  • Metadata and lineage tracking, discoverability

When performing validation of the payload, we can't assume that we need to take additional steps before we can actually validate what we're interested in. For example, if consumers have no way of knowing if they first need to transform the message (i.e. decrypt content inside the payload after deserializing) before they can provide their expected functionality, then it's hard for them to trust that they can safely use the schema from dataschema without risk of it blowing up. Additionally, there still needs to be a way to validate the payload content prior to further operations on it (like decryption) to prevent situations like an invalid cipher or issues with any of the values used to inform subsequent processing steps.

Typically, it would be a code smell if we needed to provide a chain of instructions to the next consumer where each instruction involved a transformation and a post-transformation schema (unless perhaps we're constructing a DAG or something like that.) However, what's interesting about the encryption/decryption case is that it's special in that there's a second layer that must be handled by the exact target that's receiving the message since it's one of the few cases where it's not actually appropriate to say "subsequent processing should be handled by a downstream function" since we don't want to force consumers to need to transmit unencrypted content just to perform validation (or otherwise use the schema) for the content they actually care about.

However, we don't want to change expectations of the dataschema attribute. So, the question then is if there should be a new attribute or extension for cases that need to provide the schema of the inner part of the encrypted message. Could cases exist where someone would want a more general attribute that could be used for any case where they want an internal schema (and not just for encryption/decryption)? I actually observed a requirement like that in a previous project, but when I inspected it closely, I discovered it was a horrible antipattern. The idea of allowing producers to only provide a schema for an outer part of their message is actually quite dangerous since it can defer major design issues until they're really deep inside consumer processing pipelines. So, I like the idea of having an attribute that's explicitly for encryption/decryption.

devinbost avatar Sep 21 '22 18:09 devinbost

@devinbost I took upon myself for defining such extension. But probably this will happen only at the meeting next week

sasha-tkachev avatar Sep 21 '22 18:09 sasha-tkachev

I think we need to give the different approaches names so we can discus.

JWE Approach

{   
  "specversion": "1.0",
  "source": "my-app",
  "id": "abc123",
  "type": "message.v1",
  "datacontentype": "application/jwt",
   "data": "xxx.yyy.zzz"
}

Header Extension Approach

{   
  "specversion": "1.0",
  "source": "my-app",
  "id": "abc123",
  "type": "message.v1",
  "datacontenttype": "application/octetstream",
  "encalg": "RSA-OAEP",
  "encenc": "A256CBC-HS512",
  "enckid": "abdd85",   
  "encdct": "application/json"
  "data_base64": "abc123"
}

Mulitpart Encypted Approach

{   
  "specversion": "1.0",
  "source": "my-app",
  "id": "abc123",
  "type": "message.v1",
  "datacontenttype": "multipart/encrypted",
  "data_base64": "abc123"
}

https://docs.jboss.org/resteasy/docs/3.0.4.Final/userguide/html/ch41.html

alexec avatar Sep 23 '22 00:09 alexec

After some more analysis, the only approach that will work well for us is Header Extension Approach. Why?

  • Any approach that puts control information into data means that you need to do an additional copy whenever you encrypt or decrypt.
  • Likewise, encoding the payload into base64 adds another copy operation, worse still inflating payload sizes by 37%.
  • That is 2 extra copy operations, which you’ll do once to encrypt and once to decrypt - 4 extra copies in total.
  • Base 64 does not compress well, limiting the utility of provider encryption (like Kafka’s encryption) and further damages the performance.

For the Header Extension mechanism we must have one key field: the encryption algorithm, and that should be the same everywhere. I suggest all headers prefixed withenc are considered encryption headers. We want small headers to save space, so encalg should be the algorithm. The other header we need is the underlying data content type, this means the algorithms don’t need to know the content type (check: am I right?), simplifying them, e.g. encdtc:

decrypt(headers:map<string,string>, encryted_data:bytearray) -> (value:bytearray)
encrypt(value:bytearray) -> (header:map<string,string>, encryted_data:bytearray)

You’ll note that I talk in terms of “headers” not context attributes”. This is because encryption must happen before we materialize the event from the event bus to get the benefits. Once it is in a CE, the data must go into the data_base64 field, re-introducing the problem.

alexec avatar Sep 24 '22 23:09 alexec

Another note, because essentially happens outside the CE, and none the encryption headers or data make it into the event itself, I think the work we have is defining how this should happen, but I don’t there should not be any work to any SDKs. It is essentiallly orthogonal.

alexec avatar Sep 24 '22 23:09 alexec

@alexec I don't understand why putting the serialized JWE inside the data is not efficient. Either way you need to copy the content of the JWE metadata somehow inside the event

I'm looking into an 4th approach which is a mix of JWE approach and the header extension approach.

What is the problem with

{   
  "specversion": "1.0",
  "source": "my-app",
  "id": "abc123",
  "type": "message.v1",
  "datacontentype": "application/jwe",
   "data": "xxx.yyy.zzz"
  "cryptdatacontenttype": "application/json",
  "cryptdataschema": "http://dummy.com/my-schema.json",
}

sasha-tkachev avatar Sep 26 '22 15:09 sasha-tkachev

I have drafted an extension for this (#1090) Please take a look @alexec

sasha-tkachev avatar Sep 26 '22 16:09 sasha-tkachev

@alexec I don't understand why putting the serialized JWE inside the data is efficient. Either way you need to copy the content of the JWE metadata somehow inside the event

The key insight I needed was I had to think about how the binding would encrypt or decrypt data if you put the control headers into data:

  1. Plaintext comes in data (e.g. string).
  2. Encrypted into ciphertext (1x copy).
  3. Create a structured data object with the control headers (make this a JSON object).
  4. Encode the cipertext into a data.cipertext (1x copy)
  5. Write the data into Kafka (another marshaling? another copy?).

No think about what will happen when encrypt within the binding using Header Extension :

  1. Plaintext comes in data (e.g. string).
  2. Write headers to Kafka (including encryption control headers).
  3. Write cipertext to Kafka.

You remove 2 or even 3 copy operations each time you marshal or unmarshal doing this.

alexec avatar Sep 26 '22 17:09 alexec

@alexec I think we are trying to solve different problems here, and I must say that I'm a novice to JWE.

The problem I'm trying to solve is to provide a way for encrypted events to leverage the power of datacontenttype and dataschema attributes, because without this extension they are useless to the user given an encrypted data. Because datacontenttype will hold application/octetstream and dataschema probably will not exist.

The problem you are trying to solve is make the headers of the JWE mapped onto CloudEvent attributes for efficiency and routing reasons.

I think your problem is worth solving, but it is not general enough to put in a general "cryptography" extension. I propose another extension which maps jwe headers onto cloudevent attributes in addition to the cryptography extension

sasha-tkachev avatar Sep 26 '22 19:09 sasha-tkachev

That is an interesting point, and I think you're probably correct.

alexec avatar Sep 26 '22 20:09 alexec

JWT extension proposed under #1102

sasha-tkachev avatar Oct 14 '22 22:10 sasha-tkachev

Given the work going on in the mentioned PRs - do we still need this issue open? Any objection to closing it?

duglin avatar Jan 19 '23 14:01 duglin

This issue is stale because it has been open for 30 days with no activity. Mark as fresh by updating e.g., adding the comment /remove-lifecycle stale.

github-actions[bot] avatar Feb 19 '23 01:02 github-actions[bot]

@alexec @jolivaSan @JemDay does anyone want to follow-up on this one? I'm going to suggest we close it and we can re-open it if someone thinks we need to revisit the topic.

duglin avatar Sep 21 '23 12:09 duglin

Agreed on the 9/21 call to close this issue. If someone would like to reopen it and push it forward please let us know.

duglin avatar Sep 21 '23 16:09 duglin