Proposal: define a "ALL_CAPS" pseudo-capability to grant all capabilities
While the list of capabilities in the Kernel has been relatively stable, recently,
new capabilities were added (CAP_PERFMON, CAP_BPF, and CAP_CHECKPOINT_RESTORE).
This proved to be a challenge, as (for example), docker was updated to be aware of these new capabilities (and detects if the kernel on which it's running supports them), however, the current runc release (and possibly other runtimes) not yet recognize them.
The specification currently defines that, in order to grant capabilities to a container process, the container configuration has to specify those capabilities:
capabilities(object, OPTIONAL) is an object containing arrays that specifies the sets of capabilities for the process. Valid values are defined in the [capabilities(7)][capabilities.7] man page, such asCAP_CHOWN. Any value which cannot be mapped to a relevant kernel interface MUST cause an error.
In most situations, this is not a problem. For example, if I'm running on a 5.8+ kernel
and want to grant my container CAP_BPF capabilities, I start the container with --cap-add CAP_BPF.
Attempting to do the same on an older kernel version will produce an error (either generated
by dockerd, or by runc).
However, when granting a container all capabilities (for example, when using
--cap-add=ALL, or when running a container with --privileged), things become
problematic.
In this situation, dockerd generates a list of all capabilities supported by the
host's kernel, and sets those capabilities in the container configuration. On a
5.8+ kernel, this will include the (CAP_PERFMON, CAP_BPF, and CAP_CHECKPOINT_RESTORE).
Docker has no option to detect what capabilities are supported by the runtime, and
runc (or other runtime) on their hand, process the list of capabilities, and
produce an error for any "unknown" capability.
While docker could account for the runtime not supporting certain capabilities
(which is what's currently done as a temporary solution https://github.com/moby/moby/pull/41563),
doing so is undesirable, as it would tightly couple the runtime (and would complicate
using alternative runtimes, such as crun, gVisor (runsc) or others).
Proposal
My proposal is to delegate generation of the "all capabilities" list to the runtime,
and to include a special ALL_CAPS (just a suggestion, I'm not attached to the name)
value in the specification.
- runtimes that do not support the
ALL_CAPSspecial value, consider it an "unknown capability", and will produce an error (as defined by the specification). - runtimes that do support the
ALL_CAPSspecial value will materialize the list of capabilities, and add all capabilities that the runtime (and active kernel) supports. - when combining
ALL_CAPSwith other capabilities (e.g.ALL_CAPSandCAP_CHMOD),ALL_CAPSmust take precedence. Alternatively, this situation could be considered ambiguous, and an error can be produced (we should consider what's more future-proof in case additional "special" values are to be added in future).
Compatibility and downsides
Ideally, docker would be able to detect what version of the runtime-spec is supported by a runtime, but this is likely a separate discussion to have.
As described above, runtimes that do not support the ALL_CAPS special value
will produce an error. This could be considered a breaking change, on the other
hand, the current situation already does not handle new capabilities to be added
to the list.
Having an ALL_CAPS capability makes the container configuration "non-declarative";
the meaning of "all" capabilities will depend on the runtime, and the kernel on
which it's running. I don't think that's worse than the current situation, in
which the same applies, only at a higher level (dockerd or containerd supporting
the new capabilities).
@mrunalp @vbatts @dqminh @hqhq @cyphar @giuseppe @crosbymichael @tianon PTAL
/cc @tonistiigi @cpuguy83
I think a standard way to ask the runtime about what it supports might be better. The runtime could return a JSON doc with everything it supports in, and the runtime should always use a subset.
alternative idea: what do you think about supporting the capability value in addition to its name?
e.g.
"capabilities": {
"bounding": [
"CAP_CHOWN",
"1",
"CAP_DAC_READ",
...
the higher level runtimes could read the maximum value from /proc/sys/kernel/cap_last_cap and use it to fill the OCI configuration. An advantage is that it could be used on newer kernels without requiring changes in the OCI runtime.
cap_from_name() seems to already support it
Yes, I think numeric values would work (at a cost of not being very human-readable, but perhaps that's not the biggest concern 🤔)
Having runc report it isn't bad, but I think in practice it is not very usable for this case.
runc's update cycle is very different from higher level runtimes, so we can:
- cache (forever... until restart) the caps and get out of sync on update
- have an expiring cache and query periodically and still have some potential for being out of sync
- query at each run and incur significant container startup overhead.
It would be nice to not depend on a library to have an update to date listing of names if nothing else than because of this discrepancy in update cycles.
I think a standard way to ask the runtime about what it supports might be better. The runtime could return a JSON doc with everything it supports in, and the runtime should always use a subset.
Related: https://github.com/opencontainers/runc/pull/3296, which implemented a runc features subcommand to get that information from runc, and related proposal in this repo; https://github.com/opencontainers/runtime-spec/pull/1130