node_exporter icon indicating copy to clipboard operation
node_exporter copied to clipboard

node_exporter OCI images can't run as non-root and use CAP_PERFMON in kubernetes

Open rtreffer opened this issue 2 months ago • 4 comments

TL;DR

You can't use the perf collector in kubernetes without root. CAP_PERFMON for the container is not enough, the container needs a file with CAP_PERFMON. Giving the node_exporter binary CAP_PERFMON works as expected.

Desired outcome

Enable perf collector for instructions and cycles metrics while running node-exporter:

  • in kubernetes (1.32, cri-o 1.32)
  • as non-root (e.g. nobody)
  • with CAP_PERFMON

In other words: use minimal privileges for node_exporter. The sysctl change is still required, however a value of 1 is sufficient (no unprivileged access needed, higher values may work but I haven't tested this yet).

Testing the status quo

I am testing on ubuntu 24.04 LTS, kubernetes + cri-o 1.32 from the official deb packages, kubeadm with default settings. The only adjustment is the sysctl change for perf_event_paraniod to 1 and disabled swapping.

The test daemonset yaml is: (derived from the community helm chart)

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: node-exporter
  namespace: monitoring
spec:
  selector:
    matchLabels:
      app.kubernetes.io/name: node-exporter
  template:
    metadata:
      labels:
        app.kubernetes.io/name: node-exporter
    spec:
      containers:
        - name: node-exporter
          image: quay.io/prometheus/node-exporter:latest
          imagePullPolicy: Always
          args:
            - --path.procfs=/host/proc
            - --path.sysfs=/host/sys
            - --collector.disable-defaults
            - --collector.perf
            - --collector.perf.hardware-profilers=CpuCycles,CpuInstr,CacheRef,CacheMisses,BranchInstr,BranchMisses,StalledCyclesBackend,StalledCyclesFrontend,RefCpuCycles
            - --collector.perf.software-profilers=PageFault,ContextSwitch,CpuMigration,MinorFault,MajorFault
            - --collector.perf.cache-profilers=InstrTLBReadHit
          securityContext:
            capabilities:
              add: [ "PERFMON" ]
          volumeMounts:
            - name: proc
              mountPath: /host/proc
              readOnly:  true
            - name: sys
              mountPath: /host/sys
              readOnly: true
      tolerations:
        - effect: NoSchedule
          operator: Exists
      volumes:
        - name: proc
          hostPath:
            path: /proc
        - name: sys
          hostPath:
            path: /sys

tolerations are needed to run on the local single control plane node.

Capabilities as non-root

The setup leads to the following permissions

# egrep '^(Uid|Gid|Cap|NoNewPrivs)' /proc/$(pidof node_exporter)/status
Uid:    65534   65534   65534   65534
Gid:    65534   65534   65534   65534
CapInh: 0000000000000000
CapPrm: 0000000000000000
CapEff: 0000000000000000
CapBnd: 00000040000005fb
CapAmb: 0000000000000000
NoNewPrivs:     0

The privileges are locked down, the bounding set includes CAP_PERFMON (00000040000005fb) and the container is running as non-root (nobody / 65534).

node_exporter is not able to export perf collector related metrics as the process can't get CAP_PERFMON into effect.

Capabilities as root

Adding a security context that runs the pod as root

      securityContext:
        runAsNonRoot: false
        runAsUser: 0

leads to the following permissions:

# egrep '^(Uid|Gid|Cap|NoNewPrivs)' /proc/$(pidof node_exporter)/status
Uid:    0       0       0       0
Gid:    0       0       0       0
CapInh: 0000000000000000
CapPrm: 00000040000005fb
CapEff: 00000040000005fb
CapBnd: 00000040000005fb
CapAmb: 0000000000000000
NoNewPrivs:     0

Note that the effective capabilities are now in line with the bounding set. perf collector metrics work.

No capabilities and root

Changing the container security context to

          securityContext:
            capabilities:
              add: [ ]

and keeping root leads to

# egrep '^(Uid|Gid|Cap|NoNewPrivs)' /proc/$(pidof node_exporter)/status
Uid:    0       0       0       0
Gid:    0       0       0       0
CapInh: 0000000000000000
CapPrm: 00000000000005fb
CapEff: 00000000000005fb
CapBnd: 00000000000005fb
CapAmb: 0000000000000000
NoNewPrivs:     0

There are no perf collector metrics. root in a container alone is not enough to get the metrics (this is good).

Results of the status quo

  • CAP_PERFMON makes it possible to fetch performance metrics
    • even under restricted (<4, >0) conditions
  • root is required to get CAP_PERFMON into effect

Overcoming root

What we have:

CapPrm: 0000000000000000
CapEff: 0000000000000000
CapBnd: 00000040000005fb

what we want

CapPrm: 0000004000000000
CapEff: 0000004000000000
CapBnd: 00000040000005fb

man capabilities has extensive documentation on how CapPerm and CapEff are calculated when executing a process. There are 2 ways in which we could gain CAP_PERFMON

  • Set file capability (setcap cap_perfmon=ep node_exporter)
  • Use a wrapper that configures the ambient and inheritable set and then executes the binary

TL;DR: non-root usage requires some file with CAP_PERFMON permissions

Note that busybox does not support the PERFMON capability, at least not by name. I also could not find any binaries in /bin/ that would have that capability or special privileges (overall good).

Let's try the other approach by building a node-exporter:perfmon image.

# cat Dockerfile
ARG ARCH="amd64"
ARG OS="linux"

FROM quay.io/prometheus/node-exporter:latest

FROM ubuntu:24.04
COPY --from=0 /bin/node_exporter /srv/node_exporter
RUN  apt update && apt install -y libcap2-bin && \
     setcap cap_perfmon=ep /srv/node_exporter

FROM quay.io/prometheus/busybox-${OS}-${ARCH}:latest
COPY --from=1 /srv/node_exporter /bin/node_exporter

EXPOSE      9100
USER        nobody
ENTRYPOINT  [ "/bin/node_exporter" ]
# podman build -t localhost/node-exporter:perfmon .

Running the patched image as non-root leads to the following security permissions

# egrep '^(Uid|Gid|Cap|NoNewPrivs)' /proc/$(pidof node_exporter)/status
Uid:    65534   65534   65534   65534
Gid:    65534   65534   65534   65534
CapInh: 0000000000000000
CapPrm: 0000004000000000
CapEff: 0000004000000000
CapBnd: 00000040000005fb
CapAmb: 0000000000000000
NoNewPrivs:     0

as well as working perfmon metrics. Removing the CAP_PERFMON would lead to a startup error.

Question / Path forward

How should we proceed with this? I am happy to help, but what would be the desired path forward?

  • CAP_PERFMON was added with Linux 5.8 (2020), older kernels with long term support exist
  • This may exclude using CAP_PERFMON by default or shipping images with this enabled by default (TODO: test this)
  • Startup failures without CAP_PERFMON would be annoying for users :-/
  • Running node_exporter as non-root with working perf metrics seems highly desirable as instructions per second and instructions per cycle are important performance metrics
  • Usefulness might be limited if the sysctl change is always needed....

rtreffer avatar Nov 09 '25 16:11 rtreffer

I think the reason why CAP_PERFMON does not work with the ubuntu default of 4 is a custom patch that fails perf syscalls without CAP_SYS_ADMIN

rtreffer-rddt avatar Nov 10 '25 10:11 rtreffer-rddt

Filed an Ubuntu bug for the fact that cap_perfmon is insufficient by default: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2131046

rtreffer avatar Nov 10 '25 21:11 rtreffer

@rtreffer So you're saying that setting cap_perfmon on the binary is only required on ubuntu? If thats the case and that patch is fixed, there is nothing to do for us, right?

discordianfish avatar Nov 12 '25 09:11 discordianfish

@rtreffer So you're saying that setting cap_perfmon on the binary is only required on ubuntu? If thats the case and that patch is fixed, there is nothing to do for us, right?

TL;DR: I would suggest we recommend CAP_PERFMON instead of lowering the paranoid level, but that is currently inenffectice due by bugs

I am still wrapping my head around this though.

CAP_PERFMON should always be enough to use the perf collector. If it is required depends on the default setting of perf_event_paranoid.

However Debian and Ubuntu have patches that are buggy and require CAP_SYS_ADMIN. Ubuntu has this enabled by default (haven't checked Debian yet).

Ubuntu has now a patch floating to fix this.

I think that at least the README should be updated once this settles. It is better to give node_exporter very specific permissions than opening up the perf interface to everyone as there have been CVEs related to the perf interface in the past.

I haven't checked the broader default for that sysctl but would expect that most systems have some paranoid level enabled and CAP_PERFMON should be preferred over lowering the paranoid level (lowering the paranoid setting is equivalent to giving CAP_PERFMON to everyone).

I guess my next TODO will be to verify the wider ecosystem a bit.

Overall I would like to see a node_exporter that can be pulled easily and can export the perf data. But this might be too much of a corner use case.

rtreffer avatar Nov 12 '25 09:11 rtreffer