slsa Standardize `externalParameters` for CI/CD builds

There is a desire to standardize the externalParameters for CI/CD buildTypes. Right now every CI/CD system defines its own buildType, but we're starting to see a common pattern. Every CI/CD system seems to have the following model, and it would be nice if they had a common schema so that consumers can handle them the same.

Strawman schema

Based on GitHub Actions and Google Cloud Build (GCB):

buildConfigSource / workflow = Reference to the top-level build configuration, for cases when the build platform resolved and fetched the build configuration from a source repository. Consists of (not necessarily separate fields):
- type = Type of source repository: git, hg, oci, etc.
- repository = URL of the source repository.
- ref / label / version = Label or reference within the repository to resolve to a specific artifact (commit, image, etc.). Could be a mutable label or an immutable digest—whichever the tenant specified.
- path / entryPoint / target = The file or target label within the resolved artifact to find the top-level build configuration.
buildConfig = Inlined top-level build configuration, for cases when it is provided directly by the caller. (Mutually exclusive with buildConfigSource.)
sourceToBuild = Source artifact to be initially checked out, for cases when it is an independent input (and different) from buildConfigSource / buildConfig. This might be unique to Google Cloud Build; I'm unaware of other platforms that do this. But it is an important input (more important than parameters) which is why I call it out. Consists of:
- repository
- ref / label / version
parameters = Additional independent parameters beyond those above. Examples:
- inputs (GitHub Actions) / params (Tekton) / substitutions (Google Cloud Build) = parameters provided by the user via some UI / CLI / API
- vars (GitHub Actions) = variables passed in via the repository or organization
- directory (Google Cloud Build) = initial working directory in which to start the build

Some design considerations

How should we indicate that the provenance conforms to this common schema, while still indicating how to interpret (i.e. is it a GitHub Actions YAML vs Google Cloud Build YAML vs ...)?
- Separate buildType per platform (status quo) and use duck typing to indicate that it fits the common schema. (Seem fragile?)
- Separate buildType per platform (status quo) + new buildTypeCategory = cicd (we could define others in the future; field name and value TBD). I'm leaning towards this.
- One common buildType + new buildSubtype per platform.
How many fields should we use for buildConfigSource? We could merge some or all of the fields together into a single URI, but that comes with some trade-offs, including:
- Extensibility
- Alignment with resolvedDependencies
- Ease of construction and parsing
- Ambiguity
What field names should we use that make sense for a wide array of CI/CD systems?
Does this fit most CI/CD systems well? We'll need to do a broad survey to make sure.
Should parameters be standardized or type-specific? Maybe define standard names but allow other ones?
Where do we define this?
- Within the existing Provenance spec (my inclination)
- As a separate page under slsa.dev
- As a separate git repo

Aug 04 '23 14:08 MarkLodato

I support this proposal, especially if we do a broad survey to make sure it fits most CI/CD systems. I hope we can easily answer that question for Tekton and Jenkins, where some level of SLSA provenance generation is already available in Tekton Chains and the (sadly inactive) slsa-jenkins-generator.

Addressing some of the comments:

IIUC sourceToBuild also fits Concourse CI.
I don't like the idea of duck-typing to indicate that provenance conforms to the common schema. Using buildTypeCategory as an interface definition feels like it would work.
On fields for buildConfigSource, is there a reason we wouldn't want to use ResourceDescriptor?
Field names could be informed by the broad survey of CI/CD systems, we can ask/assess whether the strawperson schema makes sense.
I think we should define standard names for parameters. I do like the idea of allowing others, but does that make it a separate buildTypeCategory?
Where to define? I think a common schema/interface should live alongside the spec and current provenance schema, maybe as a separate page if that helps readability/layout but within the existing spec seems fine.

Aug 04 '23 15:08 joshuagl

On fields for buildConfigSource, is there a reason we wouldn't want to use ResourceDescriptor?

I'm hesitant to allow most of those fields, particularly name, digest, annotations, and downloadLocation unless they actually come from the user. We had decided to split externalParameters from resolvedDependencies to make that distinction more clear. Otherwise it's unclear if, say, the user actually provided the digest or download location (and thus a policy needs to check it) vs it was what the build platform actually resolved to (and thus it's OK to ignore).

On the bright side, it might make some things more clean:

buildConfig → buildConfigSource.content (though we could always do this even without using ResourceDescriptor)
buildType → buildConfigSource.mediaType (though I'm not sure that's a good fit)

We'd also have to split out path to a separate field, but that seems acceptable to me.

Aug 04 '23 15:08 MarkLodato

I support this proposal, but I think I'm missing some historical context. IIUC, buildConfigSource and sourceToBuild are roughly equivalent to configSource and materials from provenance v0.2. Why did we remove them from v1.0? Is it worth considering adding fields directly to the provenance spec rather than deal with buildSubType or buildCategory? It's hard to assess whether it's worth adding complexity by expanding the type system without understanding why the simple solution didn't work in the past.

Aug 04 '23 18:08 kpk47

IIUC, buildConfigSource and sourceToBuild are roughly equivalent to configSource and materials from provenance v0.2. Why did we remove them from v1.0?

Great question.

First, buildConfigSource is indeed equivalent to configSource in v0.2 (or definedInMaterial in v0.1), but sourceToBuild did not exist in earlier versions.

There were two major changes for v1.0:

Cleanly separate externalParameters, internalParameters, and resolvedDependencies. Previously they were all mixed together: configSource contained both the actual external parameter (uri) as well as the resolved digest, the environment was at the same level as configSource and parameters, and materials was kind of its own thing. This led to misunderstanding and a lack of clarity on how the provenance was expected to be consumed.
Generically inform builders to record all externalParameters rather than specifically configSource (with entryPoint) and parameters. Across the board, almost no one interpreted v0.2 as intended. entryPoint was particularly confusing, while GCB had the concept of sourceToBuild which didn't fit at all. Even GitHub Actions which nominally did fit OK was confused by the naming. To solve this, we simplified the model.

Between these two changes, things seemed to "click" for implementers. They seemed to understand it better and implement it with fewer mistakes.

The big difference between v0.2 configSource and the proposed buildConfigSource is:

Do not include the resolved digest.
Use better terminology that resonates with implementers and consumers, e.g. repository+ref+path instead of url+entryPoint.
Better define what the model is and what it means. (This was lacking in v0.2.)
Make it optional for builders that don't fit the CI/CD model.

Is it worth considering adding fields directly to the provenance spec rather than deal with buildSubType or buildCategory?

Yes, I think that is worth considering. However, the challenge is where to stick it without doing a major version bump. We need it to go in externalParameters, and v1 says that this is determined by the buildType. So our options are somewhat limited. I'm definitely open to more ideas though!

We could do a v2 with more invasive changes, but I don't think there's an appetite for that. My inclination is to stick to v1 and work around its quirks, and queue up a longer wish list before doing a v2.

Aug 04 '23 20:08 MarkLodato

I think this is a great idea!

The Securing Repos OpenSSF Working Group is encouraging all package registries to provide build provenance. Not all build properties have to be standardized, but having some that are consistently defined would make it easier for registries (like npm) to render information from multiple cloud CI/CD systems.

Aug 07 '23 19:08 steiza

Overall I'm supportive of this. At least for GitHub actions it's been sometimes hard to understand what should go where even in v1.0 (externalParameters vs internalParameters though I think we've gotten it mostly right). We've also seen some very confused implementations so I think giving folks implementing SLSA a roadmap or guide for how to generate it with a semi-standard format is a good idea.

A couple of general comments:

Separate buildType per platform (status quo) + new buildTypeCategory = cicd (we could define others in the future; field name and value TBD). I'm leaning towards this.

While I understand the need for this; the cat's mostly out of the bag wrt buildType, determining how to interpret the externalParameters was exactly what buildType was supposed to do. I wish we could come up with some kind of rather than adding yet-another-field we need to look at to understand how to interpret the provenance. I suppose that's the "One common buildType + new buildSubtype per platform" option. Other ideas:
- maybe a ?_type=cicd query parameter type of thing?).
- Some specs have a special field that indicates a type. Maybe something like _buildTypeCategory: cicd inside the externalParameters that implementations can look for?
I realize I'm grasping at straws a bit but I wish we could have something a bit cleaner.
There seems to be a lot of fields that are used depending on the provider and some that seem fairly provider specific.
- ref / label / version
- path / entryPoint / target
- inputs, vars (GitHub Actions) / params (Tekton) / substitutions (Google Cloud Build)
Are just certain fields (like parameters or sourceToBuild) going to be provider specific? Just looking at this it feels like we have defined some "common" fields but actual provenance generated in practice will look very different for each provider so I wonder if we've actually made our lives easier. I can see @steiza's point that UI's could be implemented easier, but I'm not sure this actually makes slsa-verifier's life that much easier.

Maybe I'd just like to see some full examples of what each provider's provenance would look like (at least the ones we know about) before fully endorsing this (I could maybe do a GHA one if we want to split up the work). Maybe if we had a proposals/ or experimental/ directory or something we could make some PRs that could be reviewed in there without actually committing to spec or needing a separate repo.

/cc @laurentsimon

Aug 10 '23 00:08 ianlewis

Yes, I think we need real-world examples before deciding on everything.

We do have https://github.com/slsa-framework/slsa-proposals. What about creating a proposal there? That would allow us to iterate on the design and add example files.

Aug 10 '23 13:08 MarkLodato

Sent out https://github.com/slsa-framework/slsa-proposals/pull/16 that is just a copy of the first comment. @ianlewis would that help us iterate?

Aug 10 '23 21:08 MarkLodato

Sent out slsa-framework/slsa-proposals#16 that is just a copy of the first comment. @ianlewis would that help us iterate?

Yeah, I think so. Thanks.

Aug 15 '23 02:08 ianlewis

This is great proposal! I believe this generic schema will make both provenance producers and consumers life much easier.

I have a question about the platform-specific variables. They are normally referenced directly in the build config, but its values are not specified in the build config. One good example is GitHub context variables. Users can reference those variables directly inside their config (example), and GitHub will replace those variables with actual value while executing the workflow under the hood.

How should we capture those variables in the provenance? Two options are coming to my mind:

Option 1: Only capture the user-provided raw build config in the externalParameters that will only contain the reference of those context variable, but miss the actual value of those context variables. And internalParameters captures those context variables and its value.
- One downside is that internalParameters is for debugging purpose only per slsa spec. What if people want to verify those context variables later on?
Option 2: externalParameters captures the "resolved" build config in which the context variables are replaced with the actual value.
- Con 1: This does not work for the remote build config case b/c buildConfigSource and buildConfig are mutually exclusive.
- Con 2: If we put resolved config into externalParameters, we lose some information about the raw user input.

Looking for more thoughts and feedback. Thanks!! cc @lbernick

Aug 23 '23 20:08 chuangw6

buildConfigSource.ref / label / version Could be a mutable label or an immutable digest—whichever the tenant specified.

I am wondering if a mutable revision here will be a concern for the downstream consumers of the provenance. cc @AdamZWu

Aug 23 '23 20:08 chuangw6

How should we capture those variables in the provenance? Two options are coming to my mind:

Perhaps I'm misunderstanding the question, but the provenance spec says:

externalParameters = all variables provided by an external entity / user
internalParameters = all variables provided by the build platform itself, represented by builder.id

So on GitHub Actions, for example, externalParameters.parameters contains vars and inputs (the two types of user-defined variables), while internalParameters contains other things needed for reproducibility, e.g. github.event_name.

Does that help?

buildConfigSource.ref / label / version Could be a mutable label or an immutable digest—whichever the tenant specified.

I am wondering if a mutable revision here will be a concern for the downstream consumers of the provenance. cc @AdamZWu

The idea is that the externalParameters record precisely what the parameters were, while resolvedDependencies optionally records what those parameters resolved to. So if a build requested mutable label "main", and that resolved to hash abcd1234, then "main" should go in externalParameters and "abcd1234" should go in resolvedDependencies. The idea is that the externalParameters are the things under direct attacker control and thus SHOULD be validated against expectations. Whether to do validation of resolved dependencies is less clear cut.

Aug 24 '23 17:08 MarkLodato

From @MarkLodato

Should parameters be standardized or type-specific? Maybe define standard names but allow other ones?

From @joshuagl

I think we should define standard names for parameters. I do like the idea of allowing others, but does that make it a separate buildTypeCategory?

From @ianlewis

Are just certain fields (like parameters or sourceToBuild) going to be provider specific? Just looking at this it feels like we have defined some "common" fields but actual provenance generated in practice will look very different for each provider so I wonder if we've actually made our lives easier.

I don't know how much benefit we would get from standardizing the parameter names as they will be specific to a particular CI system. Artifacts built from common CI/build systems should have common parameter names and these can be documented by those systems. I see the benefit of parameters as an available envelope for platforms to consistently place additional parameters which would be needed for re-triggering a build. These would likely not be critical for third parties ingesting the provenance.

While this is not as much of a concert for some systems (GitHub actions? GCB?), with Tekton, the parameters can affect the build by enabling/disabling certain Tasks and/or by changing the behavior of Tasks. For example, if a Tekton Pipeline is configured such that a parameter can enable/disable hermetic builds without changing the builder.id, then this would have a large effect on the overall SLSA levels (separate issue on this). These types of parameters might benefit from some form of "standardization" due to their ultimate effect.

Aug 29 '23 18:08 arewm