kaniko Generated image is missing files generated via RUN

Actual behavior

Files generated via a RUN command should be included in the final image (e.g., regardless of file generation timestamp). This seems not to be the case.

I have generated a minimal Dockerfile to demonstrate this:

# syntax=docker/dockerfile:1
FROM amd64/ubuntu as test
RUN \
    apt update && \
    apt install \
        --no-install-recommends \
        --assume-yes \
        python3.11 && \
    ls -l `which python3.11` && \
    python3.11 --version

When building the image, the python3.11 command is not property installed in the generated image, although it's clearly present while building.

My build command:

/kaniko/executor --context /workspace --dockerfile ./Dockerfile --destination <my-repo>:test-tag --snapshot-mode=full --cache=true

The output of the final 2 commands can be seen in the build output:

-rwxr-xr-x 1 root root 6890080 Aug 12  2022 /usr/bin/python3.11
Python 3.11.0rc1

When the generated image is then run, the file is not found (python3.11) simply does not exist.

To test if this has to do with file timestamps, I have done the following modification:

# syntax=docker/dockerfile:1
FROM amd64/ubuntu as test
RUN \
    apt update && \
    apt install \
        --no-install-recommends \
        --assume-yes \
        python3.11 && \
    ls -l `which python3.11` && \
    python3.11 --version && \
    touch `which python3.11`

In this case, the python3.11 binary is in the generated image, but since it's not just the binary itself that is missing (but essentially most files installed via apt), the image is completely non-functional:

docker run --rm -ti <my-repo>:test-tag
root@92196457ce8a:/# python3.11 --version
python3.11: error while loading shared libraries: libexpat.so.1: cannot open shared object file: No such file or directory

Note that I have tried various alternatives using or not using --cache and using different --snapshot-mode

Expected behavior

All files are stored in the generated image.

If I build the image using the Dockerfile above via docker buildx build the image works as expected:

docker run --rm -ti <my-repo>:test-tag
root@93076a150249:/# which python3.11
/usr/bin/python3.11
root@93076a150249:/# python3.11 --version
Python 3.11.0rc1

To Reproduce Steps to reproduce the behavior:

Use Dockerfile above
Build using kankiko using the command above
Launch image, launch python, see failure (python missing or incorrectly installed)

Additional Information

Kaniko version :  v1.22.0

Description	Yes/No
Please check if this a new feature you are proposing	- [ ]
Please check if the build works in docker but not in kaniko	- [x]
Please check if this error is seen when you use `--cache` flag	- [x]
Please check if your dockerfile is a multistage dockerfile	- [ ]

Apr 19 '24 18:04 clemenskol

FYI, I looked at other tickets with a similar problem (e.g., https://github.com/GoogleContainerTools/kaniko/issues/2336), but either the root-cause described in those tickets is different or the proposed work-around did not work for me.

I have tried many different work-arounds, and none worked for me (aside from touching every file in the file-system, but this is not an option for me)

Apr 19 '24 21:04 clemenskol

another observation: if I change this to a multi-stage build AND I do more than just use RUN commands, then it sometimes works. Yes, I execute the same build twice in a row, and it seems I have about a 50% chance to get a working container image

Dockerfile:

# syntax=docker/dockerfile:1
FROM amd64/ubuntu as stage
RUN \
    apt update && \
    apt install \
        --no-install-recommends \
        --assume-yes \
        python3.11
FROM stage as final
ADD .ignore /.ignore

Build command (note that I'm not pushing remotely just to safe roundtrip time, pushing to a remote registry has the same outcome):

/kaniko/executor \
    --context /workspace \
    --dockerfile Dockerfile \
    --destination kamiko-test:3 \
    --no-push \
    --tar-path output.tar \
    --target final \
    -v debug

Test command:

docker image rm kamiko-test:3
docker load -i output.tar
docker run --rm -ti kamiko-test:3 bash -c '/usr/bin/python3.11 --version'

In some cases we get this result (installation worked):

Untagged: kamiko-test:3
Deleted: sha256:c42801f9c6b74e0dd7002f9439d0e2675fddc2070665f5646b0303e5e9277a01
Deleted: sha256:58ee2628caa0ebb2dd0b9ee2893bb7f6a3996ed8b41177a209154b270e2952f5
Deleted: sha256:c6a78351595ae2bb76e7284ec47f720e5b7d7e9f66ffab997d24436d143c491d
e2b5084e6f6a: Loading layer [==================================================>]  49.89MB/49.89MB
e74d10928493: Loading layer [==================================================>]     259B/259B
Loaded image: kamiko-test:3
Python 3.11.0rc1

In other cases we get this result (installated files were not committed to the snapshot/image):

Untagged: kamiko-test:3
Deleted: sha256:ac778b382fa91f37cfb3d35e2d56d0a52531fb42082b7e2226e44858b0167f29
Deleted: sha256:a1a681b7fa20e5528304dfe34897ebac67a8f4ff3ecceaf6774445c6fd37fe18
Deleted: sha256:6262b815a55b0dc3bb6679ac18aa94d9aa3fa1074357640627318925a53d05af
e9f9bcb2687e: Loading layer [==================================================>]  6.344MB/6.344MB
f81778963cd0: Loading layer [==================================================>]     252B/252B
Loaded image: kamiko-test:3
bash: line 1: /usr/bin/python3.11: No such file or directory

As said, it's random and about 50% to have the one or other outcome. And even more weirdly, it seems to alternate if it works or if it fails. As if a cache would corrupt and then uncorrupt itself (note that in these experiments the cache is off).

I have captured the stdout (build command outputs) + stderr (kamiko debug-verbosity logs) from a successful and failing build. The stdout build command output is essentially identical (aside from the download/timing info from apt) The stderr kamiko debug output is very different however and once contains the expected binary in one of the logs but not the other

Apr 19 '24 22:04 clemenskol

Hi @clemenskol , did you find any workaround?

Jun 03 '24 10:06 anoop142

Hi @clemenskol , did you find any workaround?

unfortunately no. We had to move away from kaniko - it was the only "solution" that worked

Jun 03 '24 15:06 clemenskol

Same issue here and I'm pretty sure that we are not the only one having it.....

@anoop142 , anybody able to reproduce on your side ?

Jun 06 '24 06:06 jrevillard

Same issue here and I'm pretty sure that we are not the only one having it.....

@anoop142 , anybody able to reproduce on your side ?

For me the basic case that fails is

# Fails
RUN <<EOF
echo "foo" > /home/foo
EOF

RUN grep foo /home/foo

grep: /home/foo: No such file or directory

While this works

# Works
RUN echo "foo" > /home/foo

RUN grep foo /home/foo

Seems like kaniko is skipping layers when EOF is used for RUN.

Jun 06 '24 12:06 anoop142

Ok. At least seems not to be the case of @clemenskol.

As far as I'm concern I do not use EOF too but, if this could be a problem, I'm doing it inside a Gitlab CI job.

Best, Jérôme

Jun 07 '24 15:06 jrevillard

@jrevillard you are right, skipping EOF command is indeed a different issue #1713.

Jun 12 '24 10:06 anoop142

I seem to be encountering the same problem. Only in my case one out of dozens of images is broken. A couple of files from the base image are not available in the final image. It looks as if the last layer is not properly snapshotted (size 2mb instead of 150mb; worth mentioning also RUN). All images use the same base image, run on different machines. The same dockerfile and the same source files can produce the wrong image (replayed gitlab pipeline from same source). Images are built with kaniko-project/executor:v1.21.1-debug docker image in gitlab pipeline. In the logs when an image is broken, it is missing the part with ignoring socket (rest stays the same):

INFO[1181] Taking snapshot of full filesystem...        
INFO[1199] Ignoring socket signalapp.00, not adding to tar 
INFO[1199] Ignoring socket signalapp.01, not adding to tar 
INFO[1199] Ignoring socket signalapp.02, not adding to tar 
...
INFO[1574] Pushing image to ....

Upgraded to newest kaniko 1.23.2-debug and will observe results. I don't know what could be the cause, maybe something with the cache, but I don't use any additional flags.

I cant share my dockerfile and base image, but its not multistage. This is very difficult to debug, as it happens quite rarely

Aug 28 '24 10:08 kakliniew

Unfortunately, the missing files have further appeared in the latest version of kaniko. I will try to add an image test as the next stage

Sep 02 '24 10:09 kakliniew

Unfortunately, the missing files have further appeared in the latest version of kaniko.

The biggest issue I see is the lost trust in kaniko. If there is no guarantee, that the filesystem is identical to the one produced by buildx or buildah (at least semantically), I simply can't use kaniko. In production, it is almost impossible to check if all needed files are there or not.

Sep 03 '24 09:09 fmoessbauer

I tried to use the --single-snapshot flag, because sometimes the error error building image: error building stage: failed to take snapshot: archive/tar: write too long appeared as described here. Adding the flag didn't help and it also built me an image with missing files. I added RUN ls /file/location (those files that were sometimes missing) at the end of the dockerfile and out of 300 builds they all look fine (except that there has sometimes been a problem with archive/tar: write too long). I will keep observing.

Sep 19 '24 10:09 kakliniew

I'm trying to reproduce this issue, does anybody has a very simple reproduction setup. The less files are being written, the easier it will be to diagnose the root-cause. The reported one by installing Python is changing/adding so many files that it is harder to diagnose.

Oct 14 '24 11:10 Silvanoc

--single-snapshot didnt solve the issue. Even added RUN ls /file/location - lists the files during the build, but they are not available in the built image. The final broken image is a lighter - 1.4gb instead of 1.6gb. Those files that are missing are in the source image ( FROM source_image ). This problem occurs once every few dozen builds, so it is difficult to say why it occurs then. Just as randomly, the problem occurs: archive/tar: write too long.

I have changed the pipeline so that I now build the image to ‘candidate’, the next stage(gitlab stage) opens this image and the ls /file/location command, and if everything is ok, I use crane cp source destination to push it to the main tag (crane doesn't change digest). Unfortunately I can't share my dockerfile.

Oct 14 '24 12:10 kakliniew

Since I cannot reproduce this issue, I am providing you my tooling trying to reproduce it. It might help you, but you will probably need to adapt it for your system.

I am using NetBSD mtree to get a "snapshot" of the built rootfs from "inside" Kaniko build.

This is the Dockerfile of the image to be built (the one used to report this issue + mtree):

# syntax=docker/dockerfile:1
FROM amd64/ubuntu as test
RUN \
    apt update && \
    apt install \
        --no-install-recommends \
        --assume-yes \
        python3.12 \
        mtree-netbsd && \
    ls -l `which python3.12` && \
    python3.12 --version && \
    md5sum `which python3.12` > python3.12.md5
COPY ./mtree-excludes /tmp/
RUN \
    mtree \
        -c \
        -x \
        -K md5 > rootfs.built.mtree

This is a script to build the image using Kaniko and then instantiate a container using the built image. It then tries to find out if some files have "disappeared":

#!/usr/bin/env bash

set -eu

TOOL="finch"
IMG="reproduce-kaniko-3123.tar"
CONT_IMG="/workspace/${IMG}"
LOCAL_IMG="${IMG}"

echo ; echo "*********************"
echo "Building the image..." ; echo

"${TOOL}" run \
    -v $PWD:/workspace \
    gcr.io/kaniko-project/executor:latest \
        --dockerfile /workspace/Dockerfile \
        --no-push \
        --context dir:///workspace/ \
        --tar-path "${CONT_IMG}"

echo ; echo "********************"
echo "Loading the image..." ; echo

"${TOOL}" load \
    -i "${LOCAL_IMG}"

echo ; echo "***********************"
echo "Comparing the rootfs..." ; echo

echo "> python3.12 binary checksum as reported by md5sum from Kaniko"
"${TOOL}" run \
    --rm \
    unset-repo/unset-image-name:latest \
        cat /python3.12.md5

"${TOOL}" run \
    --rm \
    unset-repo/unset-image-name:latest \
        cat /rootfs.built.mtree > rootfs.built.mtree

echo ; echo "> python3.12 binary checksum as reported by mtree from Kaniko"
grep -A 1 "^    python3.12  " rootfs.built.mtree | head -n 2

"${TOOL}" run \
    --rm \
    unset-repo/unset-image-name:latest \
        mtree -f /rootfs.built.mtree > rootfs-changes.mtree \
|| true

echo ; echo "****************"
echo "Missing files..." ; echo
grep "^missing: " rootfs-changes.mtree

It runs mtree in the container to find out which changes have happened at file-system level (including timestamps, permissions, MD5 checksum,...). In my system after over 10 runs I haven been able to detect any unexpected changes (apart from Kaniko files being removed).

Let's see if with this help we can get someone to provide some more insights on what is going on...

Oct 14 '24 15:10 Silvanoc

Assuming we ever get to properly diagnose this issue and get to the root-cause, we even write a patch to fix it... will we ever see a fix getting integrated into Kaniko?

Nov 08 '24 12:11 Silvanoc