Generated image is missing files generated via RUN
Actual behavior
Files generated via a RUN command should be included in the final image (e.g., regardless of file generation timestamp). This seems not to be the case.
I have generated a minimal Dockerfile to demonstrate this:
# syntax=docker/dockerfile:1
FROM amd64/ubuntu as test
RUN \
apt update && \
apt install \
--no-install-recommends \
--assume-yes \
python3.11 && \
ls -l `which python3.11` && \
python3.11 --version
When building the image, the python3.11 command is not property installed in the generated image, although it's clearly present while building.
My build command:
/kaniko/executor --context /workspace --dockerfile ./Dockerfile --destination <my-repo>:test-tag --snapshot-mode=full --cache=true
The output of the final 2 commands can be seen in the build output:
-rwxr-xr-x 1 root root 6890080 Aug 12 2022 /usr/bin/python3.11
Python 3.11.0rc1
When the generated image is then run, the file is not found (python3.11) simply does not exist.
To test if this has to do with file timestamps, I have done the following modification:
# syntax=docker/dockerfile:1
FROM amd64/ubuntu as test
RUN \
apt update && \
apt install \
--no-install-recommends \
--assume-yes \
python3.11 && \
ls -l `which python3.11` && \
python3.11 --version && \
touch `which python3.11`
In this case, the python3.11 binary is in the generated image, but since it's not just the binary itself that is missing (but essentially most files installed via apt), the image is completely non-functional:
docker run --rm -ti <my-repo>:test-tag
root@92196457ce8a:/# python3.11 --version
python3.11: error while loading shared libraries: libexpat.so.1: cannot open shared object file: No such file or directory
Note that I have tried various alternatives using or not using --cache and using different --snapshot-mode
Expected behavior
All files are stored in the generated image.
If I build the image using the Dockerfile above via docker buildx build the image works as expected:
docker run --rm -ti <my-repo>:test-tag
root@93076a150249:/# which python3.11
/usr/bin/python3.11
root@93076a150249:/# python3.11 --version
Python 3.11.0rc1
To Reproduce Steps to reproduce the behavior:
- Use
Dockerfileabove - Build using kankiko using the command above
- Launch image, launch python, see failure (python missing or incorrectly installed)
Additional Information
Kaniko version : v1.22.0
| Description | Yes/No |
|---|---|
| Please check if this a new feature you are proposing |
|
| Please check if the build works in docker but not in kaniko |
|
Please check if this error is seen when you use --cache flag |
|
| Please check if your dockerfile is a multistage dockerfile |
|
FYI, I looked at other tickets with a similar problem (e.g., https://github.com/GoogleContainerTools/kaniko/issues/2336), but either the root-cause described in those tickets is different or the proposed work-around did not work for me.
I have tried many different work-arounds, and none worked for me (aside from touching every file in the file-system, but this is not an option for me)
another observation: if I change this to a multi-stage build AND I do more than just use RUN commands, then it sometimes works. Yes, I execute the same build twice in a row, and it seems I have about a 50% chance to get a working container image
Dockerfile:
# syntax=docker/dockerfile:1
FROM amd64/ubuntu as stage
RUN \
apt update && \
apt install \
--no-install-recommends \
--assume-yes \
python3.11
FROM stage as final
ADD .ignore /.ignore
Build command (note that I'm not pushing remotely just to safe roundtrip time, pushing to a remote registry has the same outcome):
/kaniko/executor \
--context /workspace \
--dockerfile Dockerfile \
--destination kamiko-test:3 \
--no-push \
--tar-path output.tar \
--target final \
-v debug
Test command:
docker image rm kamiko-test:3
docker load -i output.tar
docker run --rm -ti kamiko-test:3 bash -c '/usr/bin/python3.11 --version'
In some cases we get this result (installation worked):
Untagged: kamiko-test:3
Deleted: sha256:c42801f9c6b74e0dd7002f9439d0e2675fddc2070665f5646b0303e5e9277a01
Deleted: sha256:58ee2628caa0ebb2dd0b9ee2893bb7f6a3996ed8b41177a209154b270e2952f5
Deleted: sha256:c6a78351595ae2bb76e7284ec47f720e5b7d7e9f66ffab997d24436d143c491d
e2b5084e6f6a: Loading layer [==================================================>] 49.89MB/49.89MB
e74d10928493: Loading layer [==================================================>] 259B/259B
Loaded image: kamiko-test:3
Python 3.11.0rc1
In other cases we get this result (installated files were not committed to the snapshot/image):
Untagged: kamiko-test:3
Deleted: sha256:ac778b382fa91f37cfb3d35e2d56d0a52531fb42082b7e2226e44858b0167f29
Deleted: sha256:a1a681b7fa20e5528304dfe34897ebac67a8f4ff3ecceaf6774445c6fd37fe18
Deleted: sha256:6262b815a55b0dc3bb6679ac18aa94d9aa3fa1074357640627318925a53d05af
e9f9bcb2687e: Loading layer [==================================================>] 6.344MB/6.344MB
f81778963cd0: Loading layer [==================================================>] 252B/252B
Loaded image: kamiko-test:3
bash: line 1: /usr/bin/python3.11: No such file or directory
As said, it's random and about 50% to have the one or other outcome. And even more weirdly, it seems to alternate if it works or if it fails. As if a cache would corrupt and then uncorrupt itself (note that in these experiments the cache is off).
I have captured the stdout (build command outputs) + stderr (kamiko debug-verbosity logs) from a successful and failing build.
The stdout build command output is essentially identical (aside from the download/timing info from apt)
The stderr kamiko debug output is very different however and once contains the expected binary in one of the logs but not the other
Hi @clemenskol , did you find any workaround?
Hi @clemenskol , did you find any workaround?
unfortunately no. We had to move away from kaniko - it was the only "solution" that worked
Same issue here and I'm pretty sure that we are not the only one having it.....
@anoop142 , anybody able to reproduce on your side ?
Same issue here and I'm pretty sure that we are not the only one having it.....
@anoop142 , anybody able to reproduce on your side ?
For me the basic case that fails is
# Fails
RUN <<EOF
echo "foo" > /home/foo
EOF
RUN grep foo /home/foo
grep: /home/foo: No such file or directory
While this works
# Works
RUN echo "foo" > /home/foo
RUN grep foo /home/foo
Seems like kaniko is skipping layers when EOF is used for RUN.
Ok. At least seems not to be the case of @clemenskol.
As far as I'm concern I do not use EOF too but, if this could be a problem, I'm doing it inside a Gitlab CI job.
Best, Jérôme
@jrevillard you are right, skipping EOF command is indeed a different issue #1713.
I seem to be encountering the same problem. Only in my case one out of dozens of images is broken. A couple of files from the base image are not available in the final image. It looks as if the last layer is not properly snapshotted (size 2mb instead of 150mb; worth mentioning also RUN). All images use the same base image, run on different machines. The same dockerfile and the same source files can produce the wrong image (replayed gitlab pipeline from same source). Images are built with kaniko-project/executor:v1.21.1-debug docker image in gitlab pipeline. In the logs when an image is broken, it is missing the part with ignoring socket (rest stays the same):
INFO[1181] Taking snapshot of full filesystem...
INFO[1199] Ignoring socket signalapp.00, not adding to tar
INFO[1199] Ignoring socket signalapp.01, not adding to tar
INFO[1199] Ignoring socket signalapp.02, not adding to tar
...
INFO[1574] Pushing image to ....
Upgraded to newest kaniko 1.23.2-debug and will observe results. I don't know what could be the cause, maybe something with the cache, but I don't use any additional flags.
I cant share my dockerfile and base image, but its not multistage. This is very difficult to debug, as it happens quite rarely
Unfortunately, the missing files have further appeared in the latest version of kaniko. I will try to add an image test as the next stage
Unfortunately, the missing files have further appeared in the latest version of kaniko.
The biggest issue I see is the lost trust in kaniko. If there is no guarantee, that the filesystem is identical to the one produced by buildx or buildah (at least semantically), I simply can't use kaniko. In production, it is almost impossible to check if all needed files are there or not.
I tried to use the --single-snapshot flag, because sometimes the error error building image: error building stage: failed to take snapshot: archive/tar: write too long appeared as described here. Adding the flag didn't help and it also built me an image with missing files. I added RUN ls /file/location (those files that were sometimes missing) at the end of the dockerfile and out of 300 builds they all look fine (except that there has sometimes been a problem with archive/tar: write too long). I will keep observing.
I'm trying to reproduce this issue, does anybody has a very simple reproduction setup. The less files are being written, the easier it will be to diagnose the root-cause. The reported one by installing Python is changing/adding so many files that it is harder to diagnose.
--single-snapshot didnt solve the issue. Even added RUN ls /file/location - lists the files during the build, but they are not available in the built image. The final broken image is a lighter - 1.4gb instead of 1.6gb. Those files that are missing are in the source image ( FROM source_image ). This problem occurs once every few dozen builds, so it is difficult to say why it occurs then. Just as randomly, the problem occurs: archive/tar: write too long.
I have changed the pipeline so that I now build the image to ‘candidate’, the next stage(gitlab stage) opens this image and the ls /file/location command, and if everything is ok, I use crane cp source destination to push it to the main tag (crane doesn't change digest). Unfortunately I can't share my dockerfile.
Since I cannot reproduce this issue, I am providing you my tooling trying to reproduce it. It might help you, but you will probably need to adapt it for your system.
I am using NetBSD mtree to get a "snapshot" of the built rootfs from "inside" Kaniko build.
This is the Dockerfile of the image to be built (the one used to report this issue + mtree):
# syntax=docker/dockerfile:1
FROM amd64/ubuntu as test
RUN \
apt update && \
apt install \
--no-install-recommends \
--assume-yes \
python3.12 \
mtree-netbsd && \
ls -l `which python3.12` && \
python3.12 --version && \
md5sum `which python3.12` > python3.12.md5
COPY ./mtree-excludes /tmp/
RUN \
mtree \
-c \
-x \
-K md5 > rootfs.built.mtree
This is a script to build the image using Kaniko and then instantiate a container using the built image. It then tries to find out if some files have "disappeared":
#!/usr/bin/env bash
set -eu
TOOL="finch"
IMG="reproduce-kaniko-3123.tar"
CONT_IMG="/workspace/${IMG}"
LOCAL_IMG="${IMG}"
echo ; echo "*********************"
echo "Building the image..." ; echo
"${TOOL}" run \
-v $PWD:/workspace \
gcr.io/kaniko-project/executor:latest \
--dockerfile /workspace/Dockerfile \
--no-push \
--context dir:///workspace/ \
--tar-path "${CONT_IMG}"
echo ; echo "********************"
echo "Loading the image..." ; echo
"${TOOL}" load \
-i "${LOCAL_IMG}"
echo ; echo "***********************"
echo "Comparing the rootfs..." ; echo
echo "> python3.12 binary checksum as reported by md5sum from Kaniko"
"${TOOL}" run \
--rm \
unset-repo/unset-image-name:latest \
cat /python3.12.md5
"${TOOL}" run \
--rm \
unset-repo/unset-image-name:latest \
cat /rootfs.built.mtree > rootfs.built.mtree
echo ; echo "> python3.12 binary checksum as reported by mtree from Kaniko"
grep -A 1 "^ python3.12 " rootfs.built.mtree | head -n 2
"${TOOL}" run \
--rm \
unset-repo/unset-image-name:latest \
mtree -f /rootfs.built.mtree > rootfs-changes.mtree \
|| true
echo ; echo "****************"
echo "Missing files..." ; echo
grep "^missing: " rootfs-changes.mtree
It runs mtree in the container to find out which changes have happened at file-system level (including timestamps, permissions, MD5 checksum,...). In my system after over 10 runs I haven been able to detect any unexpected changes (apart from Kaniko files being removed).
Let's see if with this help we can get someone to provide some more insights on what is going on...
Assuming we ever get to properly diagnose this issue and get to the root-cause, we even write a patch to fix it... will we ever see a fix getting integrated into Kaniko?