cri-dockerd BestEffort pods are using swap

What happened?

I already opened a ticket on kube repo which leaded me here.

I was testing the support for swap and I came to an unexpected behavior. In the documentation it is specified that only pods that fall under the Burstable class can use the host's swap memory. However, I created both a deployment with 1 replica of ubuntu belonging to the Burstable class, and one belonging to the BestEffort class, where I ran the command stress --vm 1 --vm-bytes 6G --vm-hang 0 to see the consumption of memory made. The host has 4GB RAM memory and 5GB swap. In both situations, the pod started using swap after exceeding the RAM memory requirement. Wasn't the BestEffort pod supposed to be restarted when it reached the limit of the host's RAM memory? I mention that the kubelet is configured to swapBehavior=LimitedSwap. I attached two pictures where you can see the normal consumption of host, and consumption after running stress command inside pod Screenshot 2024-03-27 at 12 02 30 . screenshot_2024-03-27_at_12 37 02

What did you expect to happen?

I expected the BestEffort pod to be killed when it consumes more RAM memory than the host have available.

How can we reproduce it (as minimally and precisely as possible)?

setup a VM running ubuntu 22.04 with 4GB of RAM memory
set swap partition to 5GB
install docker, cri-dockerd and kubernetes packages using the provided versions
config kubelet with provided config
install calico cni
after the cluster is bootstrapped, deploy the following deployment

$ $ cat test.yaml 
apiVersion: apps/v1
kind: Deployment
metadata:
  name: ubuntu-deployment
  labels:
    app: ubuntu
spec:
  replicas: 1
  selector:
    matchLabels:
      app: ubuntu
  template:
    metadata:
      labels:
        app: ubuntu
    spec:
      containers:
      - name: ubuntu
        image: ubuntu:22.04
        resources:
        command: [ "/bin/bash", "-c", "--" ]
        args: [ "while true; do sleep 30; done;" ]

this should deploy a BestEffort pod. you can check this by running kubectl get pod <pod-name> --output=yaml
exec into the pod and do apt update & apt install stress. Then run stress --vm 1 --vm-bytes 6G --vm-hang 0
check the node where the pod is running with kubectl get po -o wide then ssh to that node and run htop. Now you should see that the deployed BestEffort pod is consuming swap memory, which according to the Docs, it shouldn't.
if exec into the pod and check memory.swap.max, this is set to max. From what I understand, even if swapBehavior was set to LimitedSwap in kubelet, somehow cri-dockerd may be set the cgroup for memory.swap.max to max.

$ cat /sys/fs/cgroup/memory.swap.max 
max

Anything else we need to know?

I am using cgroup v2.

Here is my kubelet config.

$ cat /var/lib/kubelet/config.yaml 
apiVersion: kubelet.config.k8s.io/v1beta1
authentication:
 anonymous:
  enabled: false
 webhook:
  cacheTTL: 0s
  enabled: true
 x509:
  clientCAFile: /etc/kubernetes/pki/ca.crt
authorization:
 mode: Webhook
 webhook:
  cacheAuthorizedTTL: 0s
  cacheUnauthorizedTTL: 0s
cgroupDriver: systemd
clusterDNS:
- 10.96.0.10
clusterDomain: cluster.local
containerRuntimeEndpoint: ""
cpuManagerReconcilePeriod: 0s
enableServer: true
evictionPressureTransitionPeriod: 0s
failSwapOn: false
featureGates:
 NodeSwap: true
fileCheckFrequency: 0s
healthzBindAddress: 127.0.0.1
healthzPort: 10248
httpCheckFrequency: 0s
imageMaximumGCAge: 0s
imageMinimumGCAge: 0s
kind: KubeletConfiguration
logging:
 flushFrequency: 0
 options:
  json:
   infoBufferSize: "0"
 verbosity: 0
memorySwap:
 swapBehavior: LimitedSwap
nodeStatusReportFrequency: 0s
nodeStatusUpdateFrequency: 0s
resolvConf: /run/systemd/resolve/resolv.conf
rotateCertificates: true
runtimeRequestTimeout: 0s
shutdownGracePeriod: 0s
shutdownGracePeriodCriticalPods: 0s
staticPodPath: /etc/kubernetes/manifests
streamingConnectionIdleTimeout: 0s
syncFrequency: 0s
volumeStatsAggPeriod: 0s

Kubernetes version

$ kubectl version
Client Version: v1.29.3
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.29.3

Cloud provider

Hetzner Cloud, but Kubernetes was deployed using `kubeadm`.

OS version

$ cat /etc/os-release
PRETTY_NAME="Ubuntu 22.04.4 LTS"
NAME="Ubuntu"
VERSION_ID="22.04"
VERSION="22.04.4 LTS (Jammy Jellyfish)"
VERSION_CODENAME=jammy
ID=ubuntu
ID_LIKE=debian
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
UBUNTU_CODENAME=jammy
$ uname -a
Linux fs-kube-dev-1 5.15.0-100-generic #110-Ubuntu SMP Wed Feb 7 13:27:48 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux

Install tools

$ kubeadm version
kubeadm version: &version.Info{Major:"1", Minor:"29", GitVersion:"v1.29.3", GitCommit:"6813625b7cd706db5bc7388921be03071e1a492d", GitTreeState:"clean", BuildDate:"2024-03-15T00:06:16Z", GoVersion:"go1.21.8", Compiler:"gc", Platform:"linux/amd64"}

Container runtime (CRI) and version (if applicable)

$ cri-dockerd --version
cri-dockerd 0.3.11 (9a8a9fe)

Related plugins (CNI, CSI, ...) and versions (if applicable)

calico: version: 3.27.2

Apr 01 '24 07:04 robertbotez

/cc

Apr 01 '24 09:04 iholder101

This is because KEP 2400 was never supported, as best I can tell.

Apr 03 '24 13:04 neersighted

yea, its more of a feature request for KEP 2400. I was hoping someone in the cri-dockerd could explore implementing this?

Apr 03 '24 18:04 kannon92

PRs are welcome, and a couple of the regular contributors have done other KEP enablement work and might be interested in picking this up (but also I can't speak for their interest or priorities).

Apr 03 '24 19:04 neersighted

I think it was a known issue, support for Docker is not a requirement for adding new features.

https://github.com/Mirantis/cri-dockerd/issues/185
https://kubernetes.io/blog/2021/11/26/qos-memory-resources/

Memory QoS in Alpha phase is designed to support containerd and cri-o.

Apr 03 '24 19:04 afbjorklund