load sometimes doesn't load
Behaviour
Trying to run a command with a just built image sometimes fails to find the image:
$ docker run --rm -t -v "${GITHUB_WORKSPACE}:/src/android/apolloui/build/outputs/" muun_android:latest
Unable to find image 'muun_android:latest' locally
docker: Error response from daemon: pull access denied for muun_android, repository does not exist or may require 'docker login': denied: requested access to the resource is denied.
The build step runs ok and has no notable differences in output between correct and failed runs.
Expected behaviour
The muun_android image to be found and run. In https://github.com/muun/apollo/runs/2203961523?check_suite_focus=true it succeded (see the Inspect step cause the build failed due to something unrelated)
Configuration
- Repository URL (if public): https://github.com/muun/apollo
- Build URL (if public): https://github.com/muun/apollo/runs/2204358021?check_suite_focus=true
name: pr
on: pull_request
jobs:
pr:
runs-on: ubuntu-20.04
steps:
- name: Set up Docker Buildx
uses: docker/setup-buildx-action@154c24e1f33dbb5865a021c99f1318cfebf27b32
with:
buildkitd-flags: --debug
- name: Checkout
uses: actions/checkout@5a4ac9002d0be2fb38bd78e4b4dbde5606d7042f
- name: Build
uses: docker/build-push-action@9379083e426e2e84abb80c8c091f5cdeb7d3fd7a
with:
load: true
tags: muun_android:latest
file: android/Dockerfile
context: .
- name: Inspect
run: |
docker images
- name: Build apollo
run: |
docker run --rm -t -v "${GITHUB_WORKSPACE}:/src/android/apolloui/build/outputs/" muun_android:latest
- name: Upload APK
uses: actions/upload-artifact@e448a9b857ee2131e752b06002bf0e093c65e571
with:
name: apk
path: apk/prod/release/apolloui-prod-release-unsigned.apk
Logs
This is happening to us too. It's super weird because we have three identical workflows set up (with different image names) - two of them succeed but one of them is constantly failing with the above error.
The workflow file:
name: Docker
on:
push:
# Publish `staging` as Docker `latest` image.
branches:
- staging
# Publish `v1.2.3` tags as releases.
tags:
- v*
env:
IMAGE_NAME: ml-intents
jobs:
# Push image to GitHub Packages.
# See also https://docs.docker.com/docker-hub/builds/
push:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2
# This is the a separate action that sets up buildx runner
- name: Set up Docker Buildx
uses: docker/setup-buildx-action@v1
# So now we can use GitHub actions' own caching for Docker layers!
- name: Cache Docker layers
uses: actions/cache@v2
with:
path: /tmp/.buildx-cache
key: ${{ runner.os }}-buildx-${{ env.IMAGE_NAME }}-${{ github.sha }}
restore-keys: |
${{ runner.os }}-buildx-${{ env.IMAGE_NAME }}-
- name: Build image
uses: docker/build-push-action@v2
with:
builder: ${{ steps.buildx.outputs.name }}
context: .
file: intents/Dockerfile
load: true
tags: ${{ env.IMAGE_NAME }}:latest
cache-from: type=local,src=/tmp/.buildx-cache
cache-to: type=local,dest=/tmp/.buildx-cache-new
- name: Login to GitHub Container Registry
uses: docker/login-action@v1
with:
registry: ghcr.io
username: ${{ github.actor }}
password: ${{ secrets.GITHUB_TOKEN }}
- name: Push image to GitHub Container Registry
run: |
IMAGE_ID=ghcr.io/${{ github.repository_owner }}/$IMAGE_NAME
# Change all uppercase to lowercase
IMAGE_ID=$(echo $IMAGE_ID | tr '[A-Z]' '[a-z]')
# Strip git ref prefix from version
VERSION=$(echo "${{ github.ref }}" | sed -e 's,.*/\(.*\),\1,')
# Strip "v" prefix from tag name
[[ "${{ github.ref }}" == "refs/tags/"* ]] && VERSION=$(echo $VERSION | sed -e 's/^v//')
# Use Docker `latest` tag convention
[ "$VERSION" == "staging" ] && VERSION=latest
echo IMAGE_ID=$IMAGE_ID
echo VERSION=$VERSION
echo Listing docker images...
docker image ls
echo Tagging image...
docker tag $IMAGE_NAME:latest $IMAGE_ID:$VERSION
echo Tagged image successfully!
echo Pushing image...
docker push $IMAGE_ID:$VERSION
echo Pushed image successfully!
- # Temp fix
# https://github.com/docker/build-push-action/issues/252
# https://github.com/moby/buildkit/issues/1896
name: Move cache
run: |
rm -rf /tmp/.buildx-cache
mv /tmp/.buildx-cache-new /tmp/.buildx-cache
The runner gets to the docker tag $IMAGE_NAME:latest $IMAGE_ID:$VERSION line and errors out with Error response from daemon: No such image: ml-intents:latest as above. docker image ls does not list the built image either.
The two successful workflows have much smaller images (500Mb and 2Gb) whereas the failing image is a lot bigger (5Gb). Could that be an influencing factor here?
@champo @benhjames Cannot repro locally or with GHA. Maybe it fails silently because of insufficient disk space:
Each virtual machine has the same hardware resources available.
- 2-core CPU
- 7 GB of RAM memory
- 14 GB of SSD disk space
You have at your disposal 14GB (actually I would say 9GB by removing the pre-installed middleware) on the runner:
/dev/sdb1 14G 4.1G 9.0G 32% /mnt
Can you add this step at the end of your workflow (before Move cache for you @benhjames) and give me the output:
- name: Disk
if: always()
run: |
df -h
docker buildx du
Thanks for investigating @crazy-max! I first added that step and a separate step to list the Docker images, but it still didn't appear to be exported into Docker. The disk space on that run seemed to match yours:
/dev/sdb1 14G 4.1G 9.0G 32% /mnt
I then modified the workflow file to exactly match yours, and the same issue occured.
Then I re-ran the same job, but this time it exported correctly. This was the first run where Docker had cache available (because previous builds before the last one never got a chance to save as it errored upon push to GCR).
I then went back to look at your first run (i.e. without build cache) and noticed that in that particular run it doesn't list the Docker images. So I have a feeling that if there is no build cache, then the export to Docker fails, but if there is build cache, like in your subsequent builds and my last build linked above, then it succeeds. Really weird. Hope that helps...?
@benhjames Thanks for your feedback. Yes actually /var/lib/docker uses /dev/root fs which is 99% full on your runner so I presume that's the issue here:
/dev/root 84G 82G 1.2G 99% /
Can you add docker buildx du in the Disk step and give me the output please?
Thanks @crazy-max, I added that command to both Disk steps (and removed the cache action) and the results can be viewed here. Looks indeed like it runs out of disk space and then silently fails loading into Docker.
Is there anything that you think could be done about this to shrink the disk usage after the build step? I notice that docker buildx du without cache lists Reclaimable: 17.71GB which seems like a lot? How come building with the cache takes up much less space?
Sorry for the questions - would be great to find a solution to this somehow (without reverting back to the plain docker build without cache like I was previously doing before this!)
@benhjames
I notice that
docker buildx duwithout cache listsReclaimable: 17.71GBwhich seems like a lot? How come building with the cache takes up much less space?
These are the subsequent instructions cached by buildx for the current builder. You can get more info by using docker buildx du --verbose. If you use an external cache, only the last stage will be cached, so it takes less space and the image can be loaded.
Is there anything that you think could be done about this to shrink the disk usage after the build step?
You could use a self-hosted runner but in the near future you will be able to configure CPU cores, RAM, disk space for the runner (see github/roadmap#161).
Or more drastic, remove some components pre-installed on the runner in your workflow like dotnet (~23GB):
- name: Remove dotnet
run: sudo rm -rf /usr/share/dotnet
Thanks a lot @crazy-max for the help, that's really useful, much appreciated. 🙌
Hi,
Thank you for this thread! I was running into the same issue. I would expect an error log of some kind when disk issues happen and the images cannot correctly --load. I couldn't find a buildx issue for this. Is the issue to track this somewhere else, or is the error log there and I'm not finding it.
Thanks!
Hey @cep21, the issue to track in buildx is https://github.com/docker/buildx/issues/593!
❤️ Thanks for the deep look into this! I ended up changing the build approach for other reasons which I guess accidentaly reduced the image size, making the issue disappear.
Hey folks! I believe I'm also hitting this issue – is there currently any workaround other than trying to shrink your image size? I tried sudo rm -rf /usr/share/dotnet, but to no avail. I only have 86% of memory used, but load still isn't loading my image into Docker.
@master-bob As discussed in https://github.com/docker/build-push-action/issues/841, I made some tests using the docker driver and the docker-container driver:
FROM alpine
RUN dd if=/dev/zero of=/tmp/output.dat bs=2048M count=1
RUN dd if=/dev/zero of=/tmp/output2.dat bs=2048M count=1
RUN dd if=/dev/zero of=/tmp/output3.dat bs=2048M count=1
RUN uname -a
jobs:
build:
runs-on: ubuntu-latest
strategy:
fail-fast: false
matrix:
driver:
- docker
- docker-container
steps:
-
name: Checkout
uses: actions/checkout@v3
-
name: Set up Docker Buildx
uses: docker/setup-buildx-action@v2
with:
driver: ${{ matrix.driver }}
buildkitd-flags: --debug
-
name: Disk
run: |
df -h
-
name: Build and push
uses: docker/build-push-action@master
with:
context: .
file: ./fat.Dockerfile
load: true
tags: |
foo
-
name: List images
run: |
docker image ls
-
name: Disk
if: always()
run: |
df -h
docker buildx du
docker driver
fs before build:
Filesystem Size Used Avail Use% Mounted on
/dev/root 84G 55G 29G 66% /
docker image ls:
Run docker image ls
REPOSITORY TAG IMAGE ID CREATED SIZE
foo latest 1636a6843a99 20 seconds ago 6.45GB
node 18 37b4077cbd8a 11 days ago 997MB
...
fs after build:
Filesystem Size Used Avail Use% Mounted on
/dev/root 84G 61G 23G 73% /
docker-container driver
fs before build:
Filesystem Size Used Avail Use% Mounted on
/dev/root 84G 55G 29G 66% /
docker image ls:
Run docker image ls
REPOSITORY TAG IMAGE ID CREATED SIZE
foo latest 50f49c8d6cd9 About a minute ago 6.45GB
node 18 37b4077cbd8a 11 days ago 997MB
...
fs after build:
Filesystem Size Used Avail Use% Mounted on
/dev/root 84G 67G 17G 80% /
As you can see when building with a container builder, Buildx will first create an intermediate tarball and load the image to Docker so that would explain the issue as it would require twice the space (~30GB) in your case.
I suggest to use the docker driver in your workflow:
-
name: Set up Docker Buildx
uses: docker/setup-buildx-action@v2
with:
driver: docker
~@tonistiigi @jedevc, I wonder if we could remove the intermediate tarball when the image is loaded to Docker. WDYT?~
As you can see when building with a container builder, Buildx will first create an intermediate tarball and load the image to Docker so that would explain the issue as it would require twice the space (~30GB) in your case.
I suggest to use the
dockerdriver in your workflow:
Thank you for the in-depth analysis.
~~I do have a question. Without using that driver, my understanding is that when using subsequent build-push-actions it will use the cached version if it is available. By changing the driver would this functionality remain the same?~~ Edit: yes, it appears functionality remains the same.
Edit: I think the dotnet location changed on ubuntu-22 as I didn't see any significant change in space usage when attempting to remove. So I opted to remove /usr/local/lib/android/sdk, ~14g, and /opt/hostedtoolcache, ~9g.
Abreviated listing of /opt/hostedtoolcache on ubuntu:latest (22):
489M /opt/hostedtoolcache/PyPy
1.6G /opt/hostedtoolcache/go
5.4G /opt/hostedtoolcache/CodeQL
16K /opt/hostedtoolcache/Java_Temurin-Hotspot_jdk
378M /opt/hostedtoolcache/node
62M /opt/hostedtoolcache/Ruby
1.2G /opt/hostedtoolcache/Python
9.1G /opt/hostedtoolcache
Before removing android and the hostedtoolcache:
Filesystem Size Used Avail Use% Mounted on
/dev/root 84G 54G 30G 65% /
and after
Filesystem Size Used Avail Use% Mounted on
/dev/root 84G 31G 53G 37% /
Just wanted to drop a note that I began experiencing this exact same issue today.
In my workflow I build 3 separate docker image(s) with all using the load: true parameter. Also, I was using caching for all the build images like so:
with:
context: ./nginx
load: true
tags: ibp_nginx:latest
cache-from: type=gha
cache-to: type=gha, mode=max
Today randomly one of the images was successfully being built but adding a step to inspect docker images -a showed that the image was never being added to docker images. I stumbled upon this thread today while looking for solution. We're also using a custom GHA runner and we had plenty of disk space available, but I tried some of the disk space proposals in this thread to no avail. I also tried deleting my entire GHA repository cache and starting the cache from scratch. No dice.
In the end I noticed this from @crazy-max up above:
As you can see when building with a container builder, Buildx will first create an intermediate tarball and load the image to Docker so that would explain the issue as it would require twice the space (~30GB) in your case.
I suggest to use the docker driver in your workflow:
name: Set up Docker Buildx
uses: docker/setup-buildx-action@v2
with:
driver: docker
Using setup-buildx-action@v2 with driver: docker resolved my issue and finally all the images are being built and available again via load: true. The downside to this of course is that this driver does not support caching from what I can tell.