Constantly growing /var/lib/docker/overlay2/
Description
We have a set of job-runners that perform various build- and deployment-related tasks using docker. After some time (weeks), all of them end up with critical disk usage and have to be re-provisioned as there does not appear to be a way to release the used disk space via the docker CLI:
$ du -sh /var/lib/docker/*
72.4M /var/lib/docker/buildkit
71.2M /var/lib/docker/containers
4.0K /var/lib/docker/engine-id
293.9M /var/lib/docker/image
132.0K /var/lib/docker/network
65.7G /var/lib/docker/overlay2
0 /var/lib/docker/plugins
0 /var/lib/docker/runtimes
0 /var/lib/docker/swarm
0 /var/lib/docker/tmp
5.0G /var/lib/docker/volumes
$ ls -l /var/lib/docker/overlay2/ | wc -l
4204
$ docker system df
TYPE TOTAL ACTIVE SIZE RECLAIMABLE
Images 2 2 123.7MB 0B (0%)
Containers 2 2 168B 0B (0%)
Local Volumes 62 0 5.292GB 5.292GB (100%)
Build Cache 0 0 0B 0B
$ docker builder ls
NAME/NODE DRIVER/ENDPOINT STATUS BUILDKIT PLATFORMS
default * docker
default default running v0.11.7+d3e6c1360f6e linux/amd64, linux/amd64/v2, linux/amd64/v3, linux/386
$ docker builder du
Reclaimable: 0B
Total: 0B
$ docker system prune -a -f
Total reclaimed space: 0B
$ docker builder prune -a -f
Total: 0B
Related to #32420, #43586
Reproduce
I'm not sure as it appears to be building up slowly over time.
Expected behavior
The used disk space under /var/lib/docker/overlay2/ shows up in any diagnostic docker ... command and can be released with any (other) docker ... command.
I would expect it to show up under docker system df and/or docker builder du and for it t be releaseable with docker system prune or docker builder prune.
docker version
Client: Docker Engine - Community
Version: 24.0.7
API version: 1.43
Go version: go1.20.10
Git commit: afdd53b
Built: Thu Oct 26 09:09:13 2023
OS/Arch: linux/amd64
Context: default
Server: Docker Engine - Community
Engine:
Version: 24.0.7
API version: 1.43 (minimum version 1.12)
Go version: go1.20.10
Git commit: 311b9ff
Built: Thu Oct 26 09:07:45 2023
OS/Arch: linux/amd64
Experimental: false
containerd:
Version: 1.6.18
GitCommit: 2456e983eb9e37e47538f59ea18f2043c9a73640
runc:
Version: 1.1.4
GitCommit: v1.1.4-0-g5fd4c4d
docker-init:
Version: 0.19.0
GitCommit: de40ad0
docker info
Client: Docker Engine - Community
Version: 24.0.7
Context: default
Debug Mode: false
Plugins:
buildx: Docker Buildx (Docker Inc.)
Version: v0.11.2
Path: /usr/libexec/docker/cli-plugins/docker-buildx
compose: Docker Compose (Docker Inc.)
Version: v2.21.0
Path: /usr/libexec/docker/cli-plugins/docker-compose
scan: Docker Scan (Docker Inc.)
Version: v0.23.0
Path: /usr/libexec/docker/cli-plugins/docker-scan
Server:
Containers: 2
Running: 2
Paused: 0
Stopped: 0
Images: 3
Server Version: 24.0.7
Storage Driver: overlay2
Backing Filesystem: xfs
Supports d_type: true
Using metacopy: false
Native Overlay Diff: true
userxattr: false
Logging Driver: json-file
Cgroup Driver: systemd
Cgroup Version: 2
Plugins:
Volume: local
Network: bridge host ipvlan macvlan null overlay
Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
Swarm: inactive
Runtimes: io.containerd.runc.v2 runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 2456e983eb9e37e47538f59ea18f2043c9a73640
runc version: v1.1.4-0-g5fd4c4d
init version: de40ad0
Security Options:
seccomp
Profile: builtin
cgroupns
Kernel Version: 5.14.0-70.13.1.el9_0.x86_64
Operating System: Rocky Linux 9.1 (Blue Onyx)
OSType: linux
Architecture: x86_64
CPUs: 4
Total Memory: 15.28GiB
Name: xxx
ID: c603ff4f-a1fb-4a1c-b323-75ef6e588528
Docker Root Dir: /var/lib/docker
Debug Mode: false
Username: xxx
Experimental: false
Insecure Registries:
127.0.0.0/8
Live Restore Enabled: false
Additional Info
No response
If a lot of "docker build" is happening on those machines, I wonder if this is related to;
- https://github.com/moby/moby/issues/46136
Which should be fixed in BuildKit v0.12 (the version used in the upcoming docker v25.0 release). If you have a system to test on (and are able to reproduce on), it may be worth trying if it reproduces with the v25.0.0 release-candidate. Packages for v25.0 are available in the "test" channel in our package repositories on download.docker.com. Those are pre-releases, so make sure to test them in a test-environment you're comfortable with trying on, but a "ga" release will be available soon (finishing up some final bits, but we may be able to do a v25.0.0 release this week).
Thanks for the quick response!
The described reproducer with parallel builds from that issue is indeed analogous to our use-case. I would assume this to be the same issue.
We will see if we can confirm this with the RC release.
Thanks! Yes, happy to hear if that resolved the issue. From discussion I had with the BuildKit team, it was "too complicated" to backport related fixes to the v0.11 release, so hoping it's fixed with the v0.12 (used in v25).
There's for sure situations where content could be left behind (unclean shutdown etc), but some reports caused us some head-scratching and we couldn't place where the content came from (or why it wouldn't be cleaned up), so hoping this is the reason (and it to be fixed).
I am seeing this too with Docker 26.1.4. What version does this patches currently applies to ? @thaJeztah
I've been having this problem since forever. overlay2 just keeps getting bigger and bigger and has 192 subfolders in it.
Already done system prune -a -f and volume prune -a -f etc. but the folder is still 11GB in size for no reason at all.
docker system df -v shows I have only 2 local volumes in use totaling just 250 kB.
I'm still seeing this with 27.3.1 that starts with a cleared /var/lib/docker and freshly re-installed docker deb packages in the morning the 16th every month. I have seen the release notes saying that various race conditions are fixed, but for me, 27.3.1 appears to fill the disk just as much as earlier versions of Docker did.
Here's the buildup for the past 14 days on one of the machines:
$ sudo docker system df
TYPE TOTAL ACTIVE SIZE RECLAIMABLE
Images 0 0 0B 0B
Containers 0 0 0B 0B
Local Volumes 0 0 0B 0B
Build Cache 0 0 0B 0B
$ sudo du -xhs /var/lib/docker/overlay2
71G /var/lib/docker/overlay2
and here's the buildup on one of the other machines:
$ sudo docker system df
TYPE TOTAL ACTIVE SIZE RECLAIMABLE
Images 7 0 4.685GB 4.685GB (100%)
Containers 0 0 0B 0B
Local Volumes 0 0 0B 0B
Build Cache 873 0 42.75GB 42.75GB
$ sudo du -xhs /var/lib/docker/overlay2
164G /var/lib/docker/overlay2
It's probably worth mentioning that the storage used is hard drives since that might affect timing/likelihood of hitting a window for a race condition.
Docker info from the first machine (but all machines have identical configuration):
$ sudo docker info
Client: Docker Engine - Community
Version: 27.3.1
Context: default
Debug Mode: false
Plugins:
buildx: Docker Buildx (Docker Inc.)
Version: v0.17.1
Path: /usr/libexec/docker/cli-plugins/docker-buildx
compose: Docker Compose (Docker Inc.)
Version: v2.29.7
Path: /usr/libexec/docker/cli-plugins/docker-compose
Server:
Containers: 0
Running: 0
Paused: 0
Stopped: 0
Images: 0
Server Version: 27.3.1
Storage Driver: overlay2
Backing Filesystem: extfs
Supports d_type: true
Using metacopy: false
Native Overlay Diff: true
userxattr: false
Logging Driver: json-file
Cgroup Driver: systemd
Cgroup Version: 2
Plugins:
Volume: local
Network: bridge host ipvlan macvlan null overlay
Log: awslogs fluentd gcplogs gelf journald json-file local splunk syslog
Swarm: inactive
Runtimes: io.containerd.runc.v2 runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 7f7fdf5fed64eb6a7caf99b3e12efcf9d60e311c
runc version: v1.1.14-0-g2c9f560
init version: de40ad0
Security Options:
apparmor
seccomp
Profile: builtin
cgroupns
Kernel Version: 5.15.0-122-generic
Operating System: Ubuntu 22.04.5 LTS
OSType: linux
Architecture: x86_64
CPUs: 4
Total Memory: 7.751GiB
Name: xxxx
ID: e7f25c28-6251-4525-9dce-278cfe2ac3d8
Docker Root Dir: /var/lib/docker
Debug Mode: false
Experimental: false
Insecure Registries:
127.0.0.0/8
Registry Mirrors:
https://mirror.com/
The first machine has an uptime of 35 days, so I don't think any "unclean shutdown" has happened in the past 14 days since docker was reinstalled.
@thaJeztah Can this get an updated version tag so it's clear it's still happening?
It's really annoying that this issue is existing forever. This is my workaround (for Debian based systems atleast)
https://github.com/docker/for-linux/issues/1423#issuecomment-1958990802
It's really annoying that this issue is existing forever. This is my workaround (for Debian based systems atleast)
Personally I just stop docker and containerd and run rm -rf /var/lib/docker. No need to reinstall it entirelly.
This is probably your build cache.
Maybe try docker buildx prune -a?
@cpuguy83 this is on the first machine from https://github.com/moby/moby/issues/47089#issuecomment-2447415923 (which had /var/lib/docker wiped and docker-packages re-installed 15 November, 8 days ago):
$ sudo docker system df
TYPE TOTAL ACTIVE SIZE RECLAIMABLE
Images 27 0 23.09GB 23.09GB (100%)
Containers 0 0 0B 0B
Local Volumes 0 0 0B 0B
Build Cache 770 0 29.49GB 29.49GB
$ sudo du -xhs /var/lib/docker/overlay2
62G /var/lib/docker/overlay2
$ sudo docker buildx prune -af
<many lines of output>
Total: 36GB
$ sudo docker system df
TYPE TOTAL ACTIVE SIZE RECLAIMABLE
Images 27 0 23.09GB 23.09GB (100%)
Containers 0 0 0B 0B
Local Volumes 0 0 0B 0B
Build Cache 0 0 0B 0B
$ sudo du -xhs /var/lib/docker/overlay2
29G /var/lib/docker/overlay2
$ sudo du -xs --si /var/lib/docker/overlay2
31G /var/lib/docker/overlay2
Does docker buildx prune -af remove something that is not covered by docker system prune -af --volumes when using the default builder in Docker Engine 27.3.1? If so, are there more prune commands hiding under some other subcommand that we should be aware of?
Even if that changed something, I still don't think the numbers add up, du shows we should have 29G (or 31G if using powers of 1000 instead of 1024) of things, and docker system df reports 23.09GB of images and 0 bytes of everything else.
du is not a good tool to use for this as it includes items multiple times due to overlay mounts.
@cpuguy83 the CI machine was idle when I ran the commands, so there was nothing being built and there were no containers running. Something in /var/lib/docker/overlay2 fills my (and other people's) disk(s) over time even when docker system df reports 0 bytes used, what tool should we use to measure disk usage for /var/lib/docker/overlay2?
@pjonsson It looks like docker system df does not show build cache usage, this is something we should fix.
To see that you can run docker buildx du
@cpuguy83 the last line from docker system df is labelled "Build Cache", and that size coincides with the private cache size reported by docker buildx du:
$ sudo docker buildx du
<many lines of output>
Shared: 6.139GB
Private: 16.03GB
Reclaimable: 22.17GB
Total: 22.17GB
$ sudo docker system df
TYPE TOTAL ACTIVE SIZE RECLAIMABLE
Images 17 0 7.922GB 7.922GB (100%)
Containers 0 0 0B 0B
Local Volumes 0 0 0B 0B
Build Cache 317 0 16.03GB 16.03GB
so I'm not sure what you mean by "missing". If you think there is something missing from that last line, I would be grateful if you fix it, but I think that is a separate bug from /var/lib/docker/overlay2 filling my 250Gb disk when there is a /etc/docker/daemon.json containing:
{
"builder": {
"gc": {
"enabled": true,
"defaultKeepStorage": "40GB"
}
}
}
and docker system df reports 0 bytes used.
You are right, I somehow looked right past it this morning when I was looking at this... too early + getting kids out the door I guess.
Can you check if there is anything mounted under /var/lib/docker/overlay2, e.g. mount | grep overlay2?
It's been a few hours, so the numbers for docker system df and docker buildx du are different now, but the private du part is still in agreement with the cache size reported by df. There is nothing mounted under overlay2 right now. I also checked that the other day when you mentioned overlay mounts, and there was nothing mounted at that time either. The machine(s) are always idle when I get sizes/etc. from them.
I'm not rejecting the possibility that there is also some kind of accounting error, because from a user perspective it looks like Docker loses track of blobs for some unknown reason. What is the shared part that docker buildx du reports, image layers that are used by the images listed in docker images? If so, is the shared part of docker buildx du (partially) accounted for by the "Images" row in docker system df?
From what I understand, it's possible to do debug builds that include some kind of race detector. Are there .deb packages available somewhere with that kind of debug build so I could download those packages and install on the CI machines to get some traces that would help you pinpoint a potential race condition?