runtime-spec icon indicating copy to clipboard operation
runtime-spec copied to clipboard

Proposal: define a "ALL_CAPS" pseudo-capability to grant all capabilities

Open thaJeztah opened this issue 5 years ago • 6 comments

While the list of capabilities in the Kernel has been relatively stable, recently, new capabilities were added (CAP_PERFMON, CAP_BPF, and CAP_CHECKPOINT_RESTORE).

This proved to be a challenge, as (for example), docker was updated to be aware of these new capabilities (and detects if the kernel on which it's running supports them), however, the current runc release (and possibly other runtimes) not yet recognize them.

The specification currently defines that, in order to grant capabilities to a container process, the container configuration has to specify those capabilities:

capabilities (object, OPTIONAL) is an object containing arrays that specifies the sets of capabilities for the process. Valid values are defined in the [capabilities(7)][capabilities.7] man page, such as CAP_CHOWN. Any value which cannot be mapped to a relevant kernel interface MUST cause an error.

In most situations, this is not a problem. For example, if I'm running on a 5.8+ kernel and want to grant my container CAP_BPF capabilities, I start the container with --cap-add CAP_BPF. Attempting to do the same on an older kernel version will produce an error (either generated by dockerd, or by runc).

However, when granting a container all capabilities (for example, when using --cap-add=ALL, or when running a container with --privileged), things become problematic.

In this situation, dockerd generates a list of all capabilities supported by the host's kernel, and sets those capabilities in the container configuration. On a 5.8+ kernel, this will include the (CAP_PERFMON, CAP_BPF, and CAP_CHECKPOINT_RESTORE). Docker has no option to detect what capabilities are supported by the runtime, and runc (or other runtime) on their hand, process the list of capabilities, and produce an error for any "unknown" capability.

While docker could account for the runtime not supporting certain capabilities (which is what's currently done as a temporary solution https://github.com/moby/moby/pull/41563), doing so is undesirable, as it would tightly couple the runtime (and would complicate using alternative runtimes, such as crun, gVisor (runsc) or others).

Proposal

My proposal is to delegate generation of the "all capabilities" list to the runtime, and to include a special ALL_CAPS (just a suggestion, I'm not attached to the name) value in the specification.

  • runtimes that do not support the ALL_CAPS special value, consider it an "unknown capability", and will produce an error (as defined by the specification).
  • runtimes that do support the ALL_CAPS special value will materialize the list of capabilities, and add all capabilities that the runtime (and active kernel) supports.
  • when combining ALL_CAPS with other capabilities (e.g. ALL_CAPS and CAP_CHMOD), ALL_CAPS must take precedence. Alternatively, this situation could be considered ambiguous, and an error can be produced (we should consider what's more future-proof in case additional "special" values are to be added in future).

Compatibility and downsides

Ideally, docker would be able to detect what version of the runtime-spec is supported by a runtime, but this is likely a separate discussion to have.

As described above, runtimes that do not support the ALL_CAPS special value will produce an error. This could be considered a breaking change, on the other hand, the current situation already does not handle new capabilities to be added to the list.

Having an ALL_CAPS capability makes the container configuration "non-declarative"; the meaning of "all" capabilities will depend on the runtime, and the kernel on which it's running. I don't think that's worse than the current situation, in which the same applies, only at a higher level (dockerd or containerd supporting the new capabilities).

thaJeztah avatar Oct 20 '20 11:10 thaJeztah

@mrunalp @vbatts @dqminh @hqhq @cyphar @giuseppe @crosbymichael @tianon PTAL

/cc @tonistiigi @cpuguy83

thaJeztah avatar Oct 20 '20 11:10 thaJeztah

I think a standard way to ask the runtime about what it supports might be better. The runtime could return a JSON doc with everything it supports in, and the runtime should always use a subset.

justincormack avatar Oct 20 '20 12:10 justincormack

alternative idea: what do you think about supporting the capability value in addition to its name?

e.g.

        "capabilities": {
            "bounding": [
                "CAP_CHOWN",
                "1",
               "CAP_DAC_READ",
                ...

the higher level runtimes could read the maximum value from /proc/sys/kernel/cap_last_cap and use it to fill the OCI configuration. An advantage is that it could be used on newer kernels without requiring changes in the OCI runtime.

cap_from_name() seems to already support it

giuseppe avatar Oct 20 '20 14:10 giuseppe

Yes, I think numeric values would work (at a cost of not being very human-readable, but perhaps that's not the biggest concern 🤔)

thaJeztah avatar Oct 20 '20 14:10 thaJeztah

Having runc report it isn't bad, but I think in practice it is not very usable for this case.

runc's update cycle is very different from higher level runtimes, so we can:

  • cache (forever... until restart) the caps and get out of sync on update
  • have an expiring cache and query periodically and still have some potential for being out of sync
  • query at each run and incur significant container startup overhead.

It would be nice to not depend on a library to have an update to date listing of names if nothing else than because of this discrepancy in update cycles.

cpuguy83 avatar Oct 20 '20 16:10 cpuguy83

I think a standard way to ask the runtime about what it supports might be better. The runtime could return a JSON doc with everything it supports in, and the runtime should always use a subset.

Related: https://github.com/opencontainers/runc/pull/3296, which implemented a runc features subcommand to get that information from runc, and related proposal in this repo; https://github.com/opencontainers/runtime-spec/pull/1130

thaJeztah avatar Dec 20 '21 16:12 thaJeztah