How strong of a guarantee is required for Isolated and Ephemeral Environment at SLSA 3?
A major difference between SLSA 2 and 3 is that the build steps run in an Isolated, Ephemeral environment. But how strong of a guarantee do we need to make to call something SLSA 3, given that network access is still allowed and thus some build steps could execute remotely? This can be solved at SLSA 4 by requiring a Hermetic environment with no network access, but at SLSA 3 it seems impossible to make a strong claim in general. Thus we need to figure out what is "good enough" for SLSA 3.
Motivation: GitHub Actions. A GitHub-hosted runner executes each job within an ephemeral, isolated VM, which would normally meet the corresponding SLSA 3 requirements. But there are several ways this can fail:
- The workflow may use a self-hosted runner for some of its jobs, which GitHub cannot make any guarantees about.
- More generally, build steps may call remote execution services, such as Google Cloud Build, which again GitHub cannot make any guarantees about.
- Build steps may open services that unintentionally allow influence from other jobs, breaking the Isolated guarantee.
In the face of this, if we have a non-falsifiable provenance that an artifact really was produced by a particular GitHub Actions workflow run, can we call it SLSA 3? Do we need to disallow self-hosted runners, for example? In general, how do we model this and what are the specific requirements?
I'm inclined to say that these requirements are "best effort" at SLSA 3, similar to the existing language for Hermetic. The controls and expectations should be clear such that a "normal" build is isolated and hermetic, and only a user doing something out of the ordinary would violate it.
So in GitHub Action's case, I'm leaning towards:
- Disallow self-hosted runners.
- Instruct users not to call out to remote execution.
- Instruct users not to open services that allow remote influence.
Any thoughts?
/cc @laurentsimon @asraa
This was presented at today's meeting (2021-03-17).
+1 to this question - from a Tekton perspective we are wondering about how to meet this requirement as well. Without disallowing network access it seems unreasonable to verify that user provided builds are not calling out to remote execution or running services themselves, so it seems reasonable to me to make this best effort and focus this requirement on the executing platform vs. the user provided build specifics.
Tekton (and any other kubernetes based CI/CD system that relies on volumes) has an interesting challenge around isolation where it is possible that a volume could be mounted by multiple builds (in our case, pods) which could influence each other, e.g. something like:
- pipeline 1 starts, planning to execute tasks 1a, 1b
- pipeline 2 starts, planning to execute tasks 2a, 2b
- 1a fetches source onto volume X
- 2a fetches also fetches source onto volume X (if both 1a and 2a were trying to use the same single writer volume, one would would wait for the other complete, and then execute)
- 1b builds an artifact from volume X (as soon as 1a completed, tekton would attempt to run 1b, which presumably would be scheduled to run after 2a since both 1b and 2a need the same volume, and 1b tried to run first) <-- the artifact resulting from pipeline 1 has been influenced by pipeline 2 (and vice versa)
In some ways this is controlled by the user (who would be responsible for having decided to run two pipelines simultaneously using the same volume) but it also seems like something that the platform could/should prevent from happening in order to meet the isolation requirement, so we're looking into ways in Tekton of ensuring that volumes aren't tampered with by rogue pods while a pipeline using them is executing.
Agree that we should label these as best effort at SLSA level 3.
So in GitHub Action's case, I'm leaning towards:
- Disallow self-hosted runners.
- Instruct users not to call out to remote execution.
- Instruct users not to open services that allow remote influence.
Completely agree with 2 & 3. Given that GitHub provides a limited set of operating systems, I'm somewhat wary of completely disallowing self-hosted runners. If someone needs to build on CentOS, I'd rather they do it on GitHub with a self-hosted runner than on a poorly managed Jenkins instance. Are the existing Ephemeral and Isolated requirements not sufficient here? Is the main difference that when a self hosted runner is involved we have to trust a build service which is GitHub + self-hosted runner, rather than just GitHub?
Is the main difference that when a self hosted runner is involved we have to trust a build service which is GitHub + self-hosted runner, rather than just GitHub?
yes. One caveat is what ephemerality means. If you use a self-hosted runner on a k8 cluster (https://github.com/actions-runner-controller/actions-runner-controller), you may be scheduling pull requests and release workflows on different pods, but on the same physical machine (GKE). Escape are not uncommon, especially for file-system reads (where k8 secrets are stored). Many things can go wrong.
Also: is a container a sufficient security boundary? The Ephemeral Environment says (emphasis added)
The build service ensured that the build steps ran in an ephemeral environment, such as a container or VM, provisioned solely for this build, and not reused from a prior build.
But containers are not a strong security boundary, so we shouldn't recommend that. We more precision here.
Perhaps we could say VM or sandboxed container such as gVisor?
This also gets into the issue that Ephemeral Environment and Isolated are really two sides of the same coin and need to be designed hand-in-hand. It's worth considering merging them into a single requirement.
This also gets into the issue that Ephemeral Environment and Isolated are really two sides of the same coin and need to be designed hand-in-hand. It's worth considering merging them into a single requirement.
This makes a lot of sense. Containers satisfy Ephemeral but mentioning them when Isolated is also a requirement almost feels disingenuous. Some container platforms can provide the security boundary, as mentioned, but we should be much clearer and merging the requirements feels like a logical step to help service implementers evaluate the requirements.
I think we're wandering into the distinction between Isolated (SLSA 3) and Hermetic (SLSA 4). I think if containers don't provide a good enough boundary for Isolated here because it's possible to escape them and they share underlying storage, then neither are VMs of any colour. In order to satisfy the strictest interpretation of Isolated, you need to go to physically distinct hardware with no network paths between which is beyond what I'd acceptably refer to as Hermetic. I think if the mechanism for isolating build processes is supposed to provide isolated environments but fails to do so because of a design or implementation flaw, we should consider these builds isolated as per the spec, and the failure as a vulnerability in the mechanism. By that measure, I'd allow containers & most VMs to be enough to pass the Isolated bar, and containers & VMs with no access to networking (except by exploiting a vulnerability) be enough to pass as Hermetic
I think if the mechanism for isolating build processes is supposed to provide isolated environments but fails to do so because of a design or implementation flaw, we should consider these builds isolated as per the spec, and the failure as a vulnerability in the mechanism.
That is a good basis, but I do think we need some guidance on a minimum bar. For example, I don't think it's sufficient to just reuse environments with documentation saying "don't do that."
In general, I believe there is a spectrum of attacker capabilities and risk level:
- Benign: Unwilling to violate documented policy or technical control (threat of firing or legal action), but may make a mistake
- Lowest: Willing to bypass simple technical controls with the skills of an average engineer (e.g. chroot)
- Low: Able to deploy common, readily available exploits with low security skill (e.g. are containers in this bucket?)
- Medium: Able to deploy 0-days exploits not requiring high skill (e.g. OS privilege escalation)
- High: Able to deploy rare/expensive exploits that require high skill (e.g. VM/sandbox breakouts, multiple layers of defense)
- Highest: Advanced persistent threats
Perhaps we should define such a scale (does one already exist?) and then offer a guidance. Two questions:
- What is the expectation for Isolated? My feeling is somewhere between Low and High. I think VM is fine but chroot is not.
- Is this a recommendation or a requirement? Do we let consumers decide what is sufficient for them?
Is this a recommendation or a requirement? Do we let consumers decide what is sufficient for them?
I am in favor of a recommendation. I think that a lot of the time, there is nuance to the threat model of the build systems as well as how the technology evolves. I think it could be helpful to provide additional recommendation on what properties are desired and how to evaluate the risk where auditors/users can then assess the risk appetite for.
I wanted to surface a comment I made in PR 700 as it relates to this conversation --
In L3, we are trying to say that we are putting requirements on the build system itself (and those that administer/maintain it) only. We are not putting requirements on the users of the build system (the producers) and the system-supported build customizations they they leverage. I was not trying to draw a distinction between default and tenant-customized build systems.
A future (higher) build system isolation requirement could be that the system is able to detect when changes have been made to the build system which might introduce insecurities. I see this as coming in two flavors: verifying that the modifications are still secure (harder) and preventing modifications, "invalidating SLSA" if they are present (easier). This definitely doesn't fit within L3, but L4 (or higher if there is a desire to go higher).
As this relates to self-hosted runners, their state would be irrelevant to L3. In order to achieve L4 with GitHub actions, however, there will need to be some ability to verify the isolation requirements on the runners or runners cannot be used. Based on the comments about hardening self-hosted runners, I would assume the latter would be more likely.
I don't know if that needs to be documented anywhere for L3 or if it would just be better to include requirements for L4 when that comes around.