bottlerocket icon indicating copy to clipboard operation
bottlerocket copied to clipboard

XFS statfs free-space grossly wrong vs superblock on 6.1.148 → kubelet false disk pressure

Open mlushpenko opened this issue 4 months ago • 1 comments

Image I'm using:

Bottlerocket 1.47.0 (aws-k8s-1.32)
Kernel: 6.1.148 #1 SMP PREEMPT_DYNAMIC Mon Sep 8 23:27:00 UTC 2025 x86_64 GNU/Linux
Filesystem: XFS on /dev/nvme1n1p1 (300 GB volume)

What I expected to happen:

df / statfs should accurately report free/used space, consistent with the XFS superblock, so that kubelet does not incorrectly mark the node as having DiskPressure.

What actually happened:

statfs (and therefore df and kubelet) reports that ~250 GiB of the 300 GiB disk is used, while the XFS superblock reports only ~2.3 GiB used.
Kubelet evicts pods due to false DiskPressure.

Examples:

statfs (df / kubelet view):

findmnt -T /var/lib/kubelet
# TARGET SOURCE               FSTYPE OPTIONS
# /var   /dev/nvme1n1p1[/var] xfs    rw,nosuid,nodev,noatime,seclabel,attr2,inode64,logbufs=8,logbsize=32k,noquota

stat -f -c 'blocks=%b bfree=%f bavail=%a bsize=%s' /var/lib/kubelet
# blocks=78604289 bfree=12384292 bavail=12384292 bsize=4096
# → ~252 GiB used

# XFS superblock (authoritative):

xfs_db -r -c "sb 0" -c "p dblocks fdblocks" /dev/nvme1n1p1
# dblocks = 78642688
# fdblocks = 78047871
# → ~2.3 GiB used

# All mountpoints (/local, /var, /opt, /mnt) are bind-mounts of the same device.

# No large deleted-open files exist:

# PID=1851 COMM=dbus-broker-lau SIZE=2097152 FILE=/memfd:dbus-broker-log (deleted)

How to reproduce the problem:

Run a bunch of pods and they will start getting evicted:

ctr -n k8s.io containers list | wc -l
# 92
ctr -n k8s.io images | wc -l
# 25
du -sh /var/lib/containerd/io.containerd.snapshotter.v1.overlayfs
# 13G     /var/lib/containerd/io.containerd.snapshotter.v1.overlayfs
du -sh /var/lib/containerd/io.containerd.content.v1.content/blobs/
# 3.5G    /var/lib/containerd/io.containerd.content.v1.content/blobs/
du -sh /run/containerd/
# 54G     /run/containerd/
df -h /var/lib/docker/
# Filesystem      Size  Used Avail Use% Mounted on
# /dev/nvme1n1p1  300G  255G   46G  85% /var

One more symptom - kubelet can't free up disk space:

usage=85 highThreshold=75 amountToFree=45866138419 lowThreshold=70
 "Failed to garbage collect required amount of images. Attempted to free 45867354931 bytes, but only found 0 bytes eligible to free."

I can cordon and discard the node, but if I can provide any more info to catch some bug, please let me know.

mlushpenko avatar Sep 28 '25 02:09 mlushpenko

I checked out a specific kernel version to look at how statfs is wokring, and it seems like maybe this is a feature, not a bug?

What we learned
Superblock (on-disk) fdblocks: 78,047,871
In-core per-AG free blocks summed: 12,929,277
statfs bfree: 12,372,598 (slightly lower than the sum because statfs subtracts xfs_fdblocks_unavailable(mp): internal “set aside” + current alloc btree blocks).
dblocks total: 78,642,688
Actual used space (by statfs) ≈ (dblocks - logblocks) - bfree ≈ (78,642,688 - 38,399) - 12,372,598 ≈ 66,231,691 blocks ≈ 252 GiB (matches what df/kubelet reported).
The “optimistic” interpretation of only ~2.3 GiB used came from trusting stale lazy-count superblock fdblocks (which haven’t been written back).
Conclusion: There’s no leakage in xfs_fs_statfs. The node really has ~250 GiB consumed; the on-disk superblock counters just haven’t been updated yet (lazy superblock counters are amortized and only periodically flushed). This is expected behavior with lazy-count=1. The kubelet eviction is therefore reacting to actual disk usage, not a statfs bug.

Why the superblock looked “wrong”
With lazy-count enabled, XFS keeps fdblocks/icount/ifree updates mainly in per-AG/percpu counters and delays writing aggregated counts to the primary superblock until:

- Certain sync or unmount events
- Log tail pushes or quota operations
- Explicit sync (sync/fsfreeze)

So xfs_db -r reading the raw sb gave you a stale, overly large fdblocks value.

mlushpenko avatar Sep 28 '25 06:09 mlushpenko