Polykey Vault and File Schema - Ingress and Egress Schemas

Specification

Egress Schema

Schema

The schema is a .json.schema file which can have an arbitrary name. This schema file should allow importing other schema files to create a hierarchy of schemas, ensuring minimal repetition. It would also allow for schema composition by creating sub-schemas like one for credit cards and one for login information can be composed to make a AWS credential schema. The ingress schema must be present in the vault and the egress schema must be present outside it.

The library https://github.com/bcherny/json-schema-to-typescript can be used to generate typescript interfaces from aJSON schema file. However, this particular library requires the project to be ESM, which Polykey is currently not. It need js-db and js-workers to be migrated before Polykey can be migrated to ESM too.

The selected library has a restriction of not being able to use arbitrary extensions for the schema, so that should be a file name restriction for the schema. Moreover, the schema should also be a valid schema, otherwise the command will error out.

Ingress

Ingress schemas control inputs and mutations to vaults. Mutations can all be considerd inputs. They thus maintain guarantees on how the vault can change over commits. These schemas are saved along with the vaults and are shared with them too.

If a schema is specified, then make sure the incoming data conforms to the requirements as specified from that schema. If the new data does not conform, then an error can be thrown. If the schema does not allow the file to exist, then another error will be thrown. Basically, the schema will ensure the secrets inside a vault are maintained with respect to a schema.

This statement is not fully correct, as secrets which already exist in a vault will not be affected. The schema implementation will only affect new incoming data and when the schema flag is used. So, the schema does not describe the state of a vault, but rather describes constraints for data that can be optionally applied.

Implementing this would be challenging as it would need the implementation of a pre-write or a post-write and pre-commit hook which applies this schema. However, this can be worked around where secret data is streamed, as the data can be checked before being written to the EFS.

Egress

Egress schemas are interpreted by Polykey during egress commands. They allow the user to document in a machine parseable and human readable way (thus JSON schema is best for this), what the expected egress output should look like, and this is useful for when Polykey performs an egress action, and it can interpret the action to check if what it is egressing is going to match what is expected, and fail if it doesn't and explain to the user WHY it is failing.

This is useful because it allows Polykey to provide a dynamic fine-grained access control context, where the vault path references are resolved contextually. So that gestalt A's vault contents is not necessarily the same as gestalt B's vault contents. This allows easier separation of dev, staging, prod as well per-user secrets.

The main reason for the existence of egress schemas is to enforce the Principle of Least Authority. A vault containing secrets for a repository doesn't need to export all at once. The development workflow might need only half the secrets, and the production might need the other half. Constantly exporting all secrets could be a source of authority leaks. As such, a schema would restrict the exported secrets to only the ones specified, tightening the security as much as possible around the egress points.

This will probably be the most common use-case of the schemas. This would ensure that that data egresses in a particular format and only for the files specified, and only in the specified types. If a secret does not match the specified type, it would throw an error. Perhaps we can attempt a conversion, like converting a number to a string, but it would be inconsistent, so it should instead throw errors.

Thanks to egress schemas, if the available secrets comply with the specified schema, then it would run just fine, but the command will fail to run if the secrets fail to match the schema. This would make sure that the environment will be correctly set up with the secrets, or the command will fail. Of course, this setup will be correct only of the schema was correct in the first place.

The focus should be on egress points first, as that will be useful sooner than ingress points, and is also easier to implement and test out.

Additional context

Also see zeta.house and Zeta-House-Docs for latest usage of composed schemas.

The following links point to the legacy codebase on GitLab, which does not exist anymore

Tasks

Develop the JSON structure of schema
Apply validation logic of schema to the vault contents at ingress points
Apply schema enforcing at egress points
Integrate into creation of vault

Aug 08 '21 11:08 CMCDragonkai

I've been writing up some of my own thoughts on vault and file schemas in our MR for vaults refactoring https://gitlab.com/MatrixAI/Engineering/Polykey/js-polykey/-/merge_requests/205#note_689647018. I'll synthesise my thoughts, and bring them into here for discussion.

Sep 29 '21 07:09 joshuakarp

When considering vault schemas, I've been thinking about what the "intention" of a vault is. I've thought of a few different approaches to this:

1. A "relational database"-like structure

We store secrets of a specific structure, and enforce that all secrets within this vault follow this specific structure (with the possibility for optional fields).

This would mean that the structure of the secret itself is dependent on the schema of the vault. That is, the individual components of some composite secret are structured at the vault level.

For example, suppose we had a vault schema represented like a JSON as follows:

{
  "label": {
    "/mediatype": "text/plain",
  },
  "url": {
    "/mediatype": "text/plain",
  },
  "username": {
    "/mediatype": "text/plain",
  },
  "password": {
    "/mediatype": "text/plain",
  },
  "note": {
    "/mediatype": "text/plain",
  },
}

Then, with this relational database-like structure of the vault, our vault would appear as follows:

label	url	username	password	note
amazon	amazon.com.au	user1	password1	my amazon login
twitter	twitter.com.au	user1	password1	my twitter login
...	...	...	...	...

2. A specified collection of secrets

We store a specific set of secrets within our vault. I see this more like a directory of files, where we specify a list of secrets that must be found in the vault (could also limit some of these secrets as optional).

For example, suppose we wanted a vault that stored all the sensitive information required for onboarding an employee at Matrix AI. We could have a JSON schema as follows:

{
  "toggl-username": {
    "/mediatype": "text/plain",
  },
  "toggl-password": {
    "/mediatype": "text/plain",
  },
  "zoho-email": {
    "/mediatype": "text/plain",
  },
  "zoho-password": {
    "/mediatype": "text/plain",
  },
  "aws-access-key": {
    "/mediatype": "text/plain",
  },
}

Then, our vault would appear as follows:

id	secret
toggl-username	amazon.com.au
toggl-password	password1
zoho-email	[email protected]
zoho-password	password1
aws-access-key	abcd1234

3. "Secret" schemas

This third option shies away from the idea of enforcing the structure of the secret at the vault level. Instead, we create schemas that specify the structure of a secret.

For example, a schema for a login secret (same as the vault schema from option 1):

{
  "label": {
    "/mediatype": "text/plain",
  },
  "url": {
    "/mediatype": "text/plain",
  },
  "username": {
    "/mediatype": "text/plain",
  },
  "password": {
    "/mediatype": "text/plain",
  },
  "note": {
    "/mediatype": "text/plain",
  },
}

Or a schema for a credit card secret (if possible, mediatype could potentially be restricted to numerical, etc):

  "label": {
    "/mediatype": "text/plain",
  },
  "cardholder-name": {
    "/mediatype": "text/plain",
  },
  "card-number": {
    "/mediatype": "text/plain",
  },
  "ccv": {
    "/mediatype": "text/plain",
  },
  "expiry": {
    "/mediatype": "text/plain",
  },
}

Then, on a vault level, the user chooses which type of secret they'd like to add to the vault (e.g. login, credit card, etc). This could be an unrestricted add, whereby any kind of secret can be added to the vault.

There's also the potential to incorporate vault schemas here as well, where we specify the specific set of secrets that we expect to be stored in a vault. This would be the same way that we do it in option 2 - only this time, we have rigid schemas for the secrets to be added.

For example, we could then have a vault schema for Matrix AI onboarding:

{
  "toggl": {
    "/secretschema": "login",
  },
  "zoho": {
    "/secretschema": "login",
  },
  "aws": {
    "/secretschema": "aws-credentials",
  },
}

And individual vaults can be created for each team member as deemed fit.

Sep 29 '21 07:09 joshuakarp

My perspective on these 3 options:

This doesn't make a huge amount of sense to me. Given that a big part of Polykey is being able to share vaults, we don't want to share an entire "database" of secrets. This would also mean that we'd need to create and share multiple vaults where we have different kinds of secrets (for example, onboarding an employee), and we'd likely have a conglomerate of vaults for lots of different purposes.
I don't feel like this is structured enough. There's too much flexibility for the user. That is, a user shouldn't have to think too much about the structure of the secrets that they need to store. We also have a repetition of structure, where we have username fields that don't share a common type. It also opens up the vault to consistency issues if we introduce optional fields. For example, we shouldn't be able to store a "username" secret without having a corresponding "password" secret. If the optional field setting is left to the user, we have potentially illogical storage.
This seems like the most balanced option to me, where we have a balanced degree of flexibility and structure. We no longer have consistency issues (like we do in option 2) because we have specific schemas for the secrets, and the vault can be as flexible or rigid as the user decides. This makes sense, given that users want to be able to share these vaults for different purposes.

Sep 29 '21 07:09 joshuakarp

Vault schemas can be nested.

{
  "dir1": {
    ...
  }
  "dir2": {
    ...
  }
}

We have to differentiate directories from files. Which could be done with the / since it is not allowed to be used in file names.

Sep 29 '21 07:09 CMCDragonkai

So this means a directory would also have its own vault schema applied to it? For example, we could have a vault schema which specifies some files and a directory, and this directory would specify another vault schema?

Sep 29 '21 23:09 joshuakarp

So I found some more discussion hidden away in a comment on one of the mock-ups: https://gitlab.com/MatrixAI/Engineering/Polykey/polykey-design/-/issues/40/designs/Vault_Schema.png?version=163940

Notably, the following example was given for a vault schema for storing a username and password inside a directory:

{
  "dirA": {
    "username": "text/plain",
    "password": "text/plain"
  }
}

This would create a vault with a directory structure like:

/dirA
/dirA/username
/dirA/password

But what if we want to have a vault that just has a username and password in the root directory (with no extra directory)? Then, the user needs to create a brand new schema for this:

{
  "username": "text/plain",
  "password": "text/plain",
}

There's an unnecessary duplication of data here. The "username" and "password" fields between the schemas don't have any relation to each other (they're just labels for a chunk of text). That is, there's no indication (besides the label) that they're both storing the same kind of secret. Similarly, the user now has 2 vault schemas to manage which are doing very similar things.

I feel that option 3 from above is an improvement over this approach, but I'm interested to discuss this.

Sep 30 '21 00:09 joshuakarp

Vault schemas are just directory schemas.

On 30 September 2021 9:45:06 am AEST, Josh @.***> wrote:

So this means a directory would also have its own vault schema applied to it? For example, we could have a vault schema which specifies some files and a directory, and this directory would specify another vault schema?

-- You are receiving this because you authored the thread. Reply to this email directly or view it on GitHub: https://github.com/MatrixAI/js-polykey/issues/222#issuecomment-930625901 -- Sent from my Android device with K-9 Mail. Please excuse my brevity.

Sep 30 '21 00:09 CMCDragonkai

These would be 2 different schemas so they are independent.

On 30 September 2021 10:15:00 am AEST, Josh @.***> wrote:

So I found some more discussion hidden away in a comment on one of the mock-ups: https://gitlab.com/MatrixAI/Engineering/Polykey/polykey-design/-/issues/40/designs/Vault_Schema.png?version=163940

Notably, the following example was given for a vault schema for storing a username and password inside a directory:
{
 "dirA": {
   "username": "text/plain",
   "password": "text/plain"
 }
}
This would create a vault with a directory structure like:
/dirA
/dirA/username
/dirA/password
But what if we want to have a vault that just has a username and password in the root directory (with no extra directory)? Then, the user needs to create a brand new schema for this:
{
 "username": "text/plain",
 "password": "text/plain",
}
There's an unnecessary duplication of data here. The "username" and "password" fields between the schemas don't have any relation to each other (they're just labels for a chunk of text). That is, there's no indication (besides the label) that they're both storing the same kind of secret. Similarly, the user now has 2 vault schemas to manage which are doing very similar things.

I feel that this is the wrong approach to take, but I'm interested to discuss this.

-- You are receiving this because you authored the thread. Reply to this email directly or view it on GitHub: https://github.com/MatrixAI/js-polykey/issues/222#issuecomment-930639085 -- Sent from my Android device with K-9 Mail. Please excuse my brevity.

Sep 30 '21 00:09 CMCDragonkai

Had a quick discussion with Roger about this. Some clarifications:

We need to remember that a "vault" can essentially be seen as a directory on the filesystem, where we store secrets (files), and embed version control inside it. As such, a directory inside a vault can analogously be seen as a nested vault.

Therefore, a vault schema is just a description for a directory. The vault schema should be minimal and flexible to reflect this filesystem structure.

For example, our username and password vault schema:

{
  "username": "text/plain",
  "password": "text/plain",
}

The vault is then expected to contain exactly these 2 text files.

Note, we can eventually utilise native features from JSON schemas to increase the expressive power of our schemas without requiring a lot of work (such as the strict flag for loosening the schema: providing optional fields, or for specifying that a schema can have additional elements).

Eventually, from the GUI's perspective, these schemas would be used to generate a form to create a vault. This would also mean we could use the properties from the JSON schema to enforce the validation logic at this level.

Similarly, note that a vault doesn't necessarily need to have a schema applied to it. For example, we could have an "unrestricted" vault (with no schema applied) that contains a collection of directories, with each of these directories having a different kind of schema applied to it.

Additionally, for a cloned vault, we'd need to consider whether we also clone the schema. The answer here is most likely yes.

Finally, schemas should be identified with a name and/or ID.

While vault schemas can be user-defined, we should also have some native schemas for users (for example, login, credit card, etc).

As a side note regarding this, does this mean a "secret" is just one file of this schema? If that's the case, then we'd end up having multiple secrets that are parts of a "composed" secret. For example, for a credit card, we'd need to store 4 different secrets: cardholder name, card number, expiry date, CCV.
Right now on the CLI, we have a secrets add command. This means to add a credit card number, we'd need to make 4 separate calls to the CLI.
I suppose we could have the opportunity here to introduce "porcelain" commands. As a rough example, could be secrets add credit-card <cardholder name> <card number> <expiry date> <ccv>.
However, we'd also have a vault schema for a credit card. When adding a secret, how do we make this connection to the vault schema? It doesn't make any sense to only add a CCV number without the other card details, for example.

Sep 30 '21 02:09 joshuakarp

As per discussion in https://github.com/MatrixAI/Polykey-CLI/issues/296#issuecomment-2452546309, it turns out we can implement an idea I call egress schemas. That is separate from vault and secret schemas. As vault and secret schemas are schemas maintaining guarantees on the content of a singular vault, whereas those schemas are in relation to the application egress point.

In a way one can say that vault/secret schemas are "ingress schemas" since they are applied on ingress into PK.

Whereas env schemas are "egress schemas" since they are applied on egress out of PK.

This means that ingress schemas are carried along inside the PK system.

Whereas egress schemas are carried by an external system such as the git repository.

Nov 01 '24 20:11 CMCDragonkai

Another thing to realise here is that the usage of references like op://a/b/c like in https://blog.1password.com/delete-your-example-env-file/ is a common idea, however I believe this is a control-plane concern, and shouldn't necessarily be placed into the application repositories as it makes too much assumptions about how the application may be used.

It does not allow for dynamic fine-grained access control context, since it basically means that everybody and every context has to have the same secret set as the references are unique globally with respect a central authority holder.
It's not really that different from our vault paths which are just vaultX:/A/B/C, we just didn't use URIs. URIs could also be used here like ARNs, and just be like pk://vaultX/A/B/C where vaultX is acting like a "hostname" with respect to the pk protocol.
Such pull flows would work best at the control plane.

Nov 01 '24 20:11 CMCDragonkai