moby icon indicating copy to clipboard operation
moby copied to clipboard

Constantly growing /var/lib/docker/overlay2/

Open arnegroskurth opened this issue 2 years ago • 5 comments

Description

We have a set of job-runners that perform various build- and deployment-related tasks using docker. After some time (weeks), all of them end up with critical disk usage and have to be re-provisioned as there does not appear to be a way to release the used disk space via the docker CLI:

$ du -sh /var/lib/docker/*
72.4M	/var/lib/docker/buildkit
71.2M	/var/lib/docker/containers
4.0K	/var/lib/docker/engine-id
293.9M	/var/lib/docker/image
132.0K	/var/lib/docker/network
65.7G	/var/lib/docker/overlay2
0	/var/lib/docker/plugins
0	/var/lib/docker/runtimes
0	/var/lib/docker/swarm
0	/var/lib/docker/tmp
5.0G	/var/lib/docker/volumes
$ ls -l /var/lib/docker/overlay2/ | wc -l
4204
$ docker system df
TYPE            TOTAL     ACTIVE    SIZE      RECLAIMABLE
Images          2         2         123.7MB   0B (0%)
Containers      2         2         168B      0B (0%)
Local Volumes   62        0         5.292GB   5.292GB (100%)
Build Cache     0         0         0B        0B
$ docker builder ls
NAME/NODE DRIVER/ENDPOINT STATUS  BUILDKIT             PLATFORMS
default * docker                                       
  default default         running v0.11.7+d3e6c1360f6e linux/amd64, linux/amd64/v2, linux/amd64/v3, linux/386
$ docker builder du
Reclaimable:	0B
Total:		0B
$ docker system prune -a -f
Total reclaimed space: 0B
$ docker builder prune -a -f
Total:	0B

Related to #32420, #43586

Reproduce

I'm not sure as it appears to be building up slowly over time.

Expected behavior

The used disk space under /var/lib/docker/overlay2/ shows up in any diagnostic docker ... command and can be released with any (other) docker ... command. I would expect it to show up under docker system df and/or docker builder du and for it t be releaseable with docker system prune or docker builder prune.

docker version

Client: Docker Engine - Community
 Version:           24.0.7
 API version:       1.43
 Go version:        go1.20.10
 Git commit:        afdd53b
 Built:             Thu Oct 26 09:09:13 2023
 OS/Arch:           linux/amd64
 Context:           default

Server: Docker Engine - Community
 Engine:
  Version:          24.0.7
  API version:      1.43 (minimum version 1.12)
  Go version:       go1.20.10
  Git commit:       311b9ff
  Built:            Thu Oct 26 09:07:45 2023
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          1.6.18
  GitCommit:        2456e983eb9e37e47538f59ea18f2043c9a73640
 runc:
  Version:          1.1.4
  GitCommit:        v1.1.4-0-g5fd4c4d
 docker-init:
  Version:          0.19.0
  GitCommit:        de40ad0

docker info

Client: Docker Engine - Community
 Version:    24.0.7
 Context:    default
 Debug Mode: false
 Plugins:
  buildx: Docker Buildx (Docker Inc.)
    Version:  v0.11.2
    Path:     /usr/libexec/docker/cli-plugins/docker-buildx
  compose: Docker Compose (Docker Inc.)
    Version:  v2.21.0
    Path:     /usr/libexec/docker/cli-plugins/docker-compose
  scan: Docker Scan (Docker Inc.)
    Version:  v0.23.0
    Path:     /usr/libexec/docker/cli-plugins/docker-scan

Server:
 Containers: 2
  Running: 2
  Paused: 0
  Stopped: 0
 Images: 3
 Server Version: 24.0.7
 Storage Driver: overlay2
  Backing Filesystem: xfs
  Supports d_type: true
  Using metacopy: false
  Native Overlay Diff: true
  userxattr: false
 Logging Driver: json-file
 Cgroup Driver: systemd
 Cgroup Version: 2
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
 Swarm: inactive
 Runtimes: io.containerd.runc.v2 runc
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: 2456e983eb9e37e47538f59ea18f2043c9a73640
 runc version: v1.1.4-0-g5fd4c4d
 init version: de40ad0
 Security Options:
  seccomp
   Profile: builtin
  cgroupns
 Kernel Version: 5.14.0-70.13.1.el9_0.x86_64
 Operating System: Rocky Linux 9.1 (Blue Onyx)
 OSType: linux
 Architecture: x86_64
 CPUs: 4
 Total Memory: 15.28GiB
 Name: xxx
 ID: c603ff4f-a1fb-4a1c-b323-75ef6e588528
 Docker Root Dir: /var/lib/docker
 Debug Mode: false
 Username: xxx
 Experimental: false
 Insecure Registries:
  127.0.0.0/8
 Live Restore Enabled: false

Additional Info

No response

arnegroskurth avatar Jan 17 '24 11:01 arnegroskurth

If a lot of "docker build" is happening on those machines, I wonder if this is related to;

  • https://github.com/moby/moby/issues/46136

Which should be fixed in BuildKit v0.12 (the version used in the upcoming docker v25.0 release). If you have a system to test on (and are able to reproduce on), it may be worth trying if it reproduces with the v25.0.0 release-candidate. Packages for v25.0 are available in the "test" channel in our package repositories on download.docker.com. Those are pre-releases, so make sure to test them in a test-environment you're comfortable with trying on, but a "ga" release will be available soon (finishing up some final bits, but we may be able to do a v25.0.0 release this week).

thaJeztah avatar Jan 17 '24 11:01 thaJeztah

Thanks for the quick response!

The described reproducer with parallel builds from that issue is indeed analogous to our use-case. I would assume this to be the same issue.

We will see if we can confirm this with the RC release.

arnegroskurth avatar Jan 17 '24 13:01 arnegroskurth

Thanks! Yes, happy to hear if that resolved the issue. From discussion I had with the BuildKit team, it was "too complicated" to backport related fixes to the v0.11 release, so hoping it's fixed with the v0.12 (used in v25).

There's for sure situations where content could be left behind (unclean shutdown etc), but some reports caused us some head-scratching and we couldn't place where the content came from (or why it wouldn't be cleaned up), so hoping this is the reason (and it to be fixed).

thaJeztah avatar Jan 17 '24 14:01 thaJeztah

I am seeing this too with Docker 26.1.4. What version does this patches currently applies to ? @thaJeztah

panteparak avatar Oct 04 '24 06:10 panteparak

I've been having this problem since forever. overlay2 just keeps getting bigger and bigger and has 192 subfolders in it.

Already done system prune -a -f and volume prune -a -f etc. but the folder is still 11GB in size for no reason at all.

docker system df -v shows I have only 2 local volumes in use totaling just 250 kB.

CicerBro avatar Oct 10 '24 09:10 CicerBro

I'm still seeing this with 27.3.1 that starts with a cleared /var/lib/docker and freshly re-installed docker deb packages in the morning the 16th every month. I have seen the release notes saying that various race conditions are fixed, but for me, 27.3.1 appears to fill the disk just as much as earlier versions of Docker did.

Here's the buildup for the past 14 days on one of the machines:

$ sudo docker system df
TYPE            TOTAL     ACTIVE    SIZE      RECLAIMABLE
Images          0         0         0B        0B
Containers      0         0         0B        0B
Local Volumes   0         0         0B        0B
Build Cache     0         0         0B        0B
$ sudo du -xhs /var/lib/docker/overlay2
71G     /var/lib/docker/overlay2

and here's the buildup on one of the other machines:

$ sudo docker system df
TYPE            TOTAL     ACTIVE    SIZE      RECLAIMABLE
Images          7         0         4.685GB   4.685GB (100%)
Containers      0         0         0B        0B
Local Volumes   0         0         0B        0B
Build Cache     873       0         42.75GB   42.75GB
$ sudo du -xhs /var/lib/docker/overlay2
164G    /var/lib/docker/overlay2

It's probably worth mentioning that the storage used is hard drives since that might affect timing/likelihood of hitting a window for a race condition.

Docker info from the first machine (but all machines have identical configuration):

$ sudo docker info
Client: Docker Engine - Community
 Version:    27.3.1
 Context:    default
 Debug Mode: false
 Plugins:
  buildx: Docker Buildx (Docker Inc.)
    Version:  v0.17.1
    Path:     /usr/libexec/docker/cli-plugins/docker-buildx
  compose: Docker Compose (Docker Inc.)
    Version:  v2.29.7
    Path:     /usr/libexec/docker/cli-plugins/docker-compose

Server:
 Containers: 0
  Running: 0
  Paused: 0
  Stopped: 0
 Images: 0
 Server Version: 27.3.1
 Storage Driver: overlay2
  Backing Filesystem: extfs
  Supports d_type: true
  Using metacopy: false
  Native Overlay Diff: true
  userxattr: false
 Logging Driver: json-file
 Cgroup Driver: systemd
 Cgroup Version: 2
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local splunk syslog
 Swarm: inactive
 Runtimes: io.containerd.runc.v2 runc
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: 7f7fdf5fed64eb6a7caf99b3e12efcf9d60e311c
 runc version: v1.1.14-0-g2c9f560
 init version: de40ad0
 Security Options:
  apparmor
  seccomp
   Profile: builtin
  cgroupns
 Kernel Version: 5.15.0-122-generic
 Operating System: Ubuntu 22.04.5 LTS
 OSType: linux
 Architecture: x86_64
 CPUs: 4
 Total Memory: 7.751GiB
 Name: xxxx
 ID: e7f25c28-6251-4525-9dce-278cfe2ac3d8
 Docker Root Dir: /var/lib/docker
 Debug Mode: false
 Experimental: false
 Insecure Registries:
  127.0.0.0/8
 Registry Mirrors:
  https://mirror.com/

The first machine has an uptime of 35 days, so I don't think any "unclean shutdown" has happened in the past 14 days since docker was reinstalled.

@thaJeztah Can this get an updated version tag so it's clear it's still happening?

pjonsson avatar Oct 30 '24 14:10 pjonsson

It's really annoying that this issue is existing forever. This is my workaround (for Debian based systems atleast)

https://github.com/docker/for-linux/issues/1423#issuecomment-1958990802

nook24 avatar Nov 22 '24 08:11 nook24

It's really annoying that this issue is existing forever. This is my workaround (for Debian based systems atleast)

docker/for-linux#1423 (comment)

Personally I just stop docker and containerd and run rm -rf /var/lib/docker. No need to reinstall it entirelly.

alpe12 avatar Nov 22 '24 09:11 alpe12

This is probably your build cache. Maybe try docker buildx prune -a?

cpuguy83 avatar Nov 22 '24 20:11 cpuguy83

@cpuguy83 this is on the first machine from https://github.com/moby/moby/issues/47089#issuecomment-2447415923 (which had /var/lib/docker wiped and docker-packages re-installed 15 November, 8 days ago):

$ sudo docker system df
TYPE            TOTAL     ACTIVE    SIZE      RECLAIMABLE
Images          27        0         23.09GB   23.09GB (100%)
Containers      0         0         0B        0B
Local Volumes   0         0         0B        0B
Build Cache     770       0         29.49GB   29.49GB
$ sudo du -xhs /var/lib/docker/overlay2
62G     /var/lib/docker/overlay2
$ sudo docker buildx prune -af
<many lines of output>
Total:  36GB
$ sudo docker system df
TYPE            TOTAL     ACTIVE    SIZE      RECLAIMABLE
Images          27        0         23.09GB   23.09GB (100%)
Containers      0         0         0B        0B
Local Volumes   0         0         0B        0B
Build Cache     0         0         0B        0B
$ sudo du -xhs /var/lib/docker/overlay2
29G     /var/lib/docker/overlay2
$ sudo du -xs --si /var/lib/docker/overlay2
31G     /var/lib/docker/overlay2

Does docker buildx prune -af remove something that is not covered by docker system prune -af --volumes when using the default builder in Docker Engine 27.3.1? If so, are there more prune commands hiding under some other subcommand that we should be aware of?

Even if that changed something, I still don't think the numbers add up, du shows we should have 29G (or 31G if using powers of 1000 instead of 1024) of things, and docker system df reports 23.09GB of images and 0 bytes of everything else.

pjonsson avatar Nov 23 '24 10:11 pjonsson

du is not a good tool to use for this as it includes items multiple times due to overlay mounts.

cpuguy83 avatar Nov 23 '24 16:11 cpuguy83

@cpuguy83 the CI machine was idle when I ran the commands, so there was nothing being built and there were no containers running. Something in /var/lib/docker/overlay2 fills my (and other people's) disk(s) over time even when docker system df reports 0 bytes used, what tool should we use to measure disk usage for /var/lib/docker/overlay2?

pjonsson avatar Nov 24 '24 09:11 pjonsson

@pjonsson It looks like docker system df does not show build cache usage, this is something we should fix. To see that you can run docker buildx du

cpuguy83 avatar Nov 25 '24 17:11 cpuguy83

@cpuguy83 the last line from docker system df is labelled "Build Cache", and that size coincides with the private cache size reported by docker buildx du:

$ sudo docker buildx du
<many lines of output>
Shared:         6.139GB
Private:        16.03GB
Reclaimable:    22.17GB
Total:          22.17GB
$ sudo docker system df
TYPE            TOTAL     ACTIVE    SIZE      RECLAIMABLE
Images          17        0         7.922GB   7.922GB (100%)
Containers      0         0         0B        0B
Local Volumes   0         0         0B        0B
Build Cache     317       0         16.03GB   16.03GB

so I'm not sure what you mean by "missing". If you think there is something missing from that last line, I would be grateful if you fix it, but I think that is a separate bug from /var/lib/docker/overlay2 filling my 250Gb disk when there is a /etc/docker/daemon.json containing:

{
  "builder": {
    "gc": {
      "enabled": true,
      "defaultKeepStorage": "40GB"
    }
  }
}

and docker system df reports 0 bytes used.

pjonsson avatar Nov 25 '24 20:11 pjonsson

You are right, I somehow looked right past it this morning when I was looking at this... too early + getting kids out the door I guess.

Can you check if there is anything mounted under /var/lib/docker/overlay2, e.g. mount | grep overlay2?

cpuguy83 avatar Nov 25 '24 21:11 cpuguy83

It's been a few hours, so the numbers for docker system df and docker buildx du are different now, but the private du part is still in agreement with the cache size reported by df. There is nothing mounted under overlay2 right now. I also checked that the other day when you mentioned overlay mounts, and there was nothing mounted at that time either. The machine(s) are always idle when I get sizes/etc. from them.

I'm not rejecting the possibility that there is also some kind of accounting error, because from a user perspective it looks like Docker loses track of blobs for some unknown reason. What is the shared part that docker buildx du reports, image layers that are used by the images listed in docker images? If so, is the shared part of docker buildx du (partially) accounted for by the "Images" row in docker system df?

From what I understand, it's possible to do debug builds that include some kind of race detector. Are there .deb packages available somewhere with that kind of debug build so I could download those packages and install on the CI machines to get some traces that would help you pinpoint a potential race condition?

pjonsson avatar Nov 25 '24 23:11 pjonsson