Step should not succeed when tar fails with error
Using a simple step to restore the build x cache we encountered an error (below). I believe the issue is that the GitHub cache became corrupt. However actions/cache did not catch the error.
- name: Cache layers
uses: actions/cache@v2
with:
path: /tmp/.buildx-cache
key: ${{ runner.os }}-buildx-${{ github.sha }}
restore-keys: |
${{ runner.os }}-buildx-
So only a warning was printed instead of failing the step:
....
/*stdin*\ : Decoding error (36) : Restored data doesn't match checksum
/usr/bin/tar: Child returned status 1
/usr/bin/tar: Error is not recoverable: exiting now
Warning: Tar failed with error: The process '/usr/bin/tar' failed with exit code 2
Maybe such error shouldn't fail the workflow because the result is technically equivalent to not finding a cache, but I believe the cache should at least be invalidated/expired to avoid recurring errors.
I've been facing the same issue with v3 and my only option was to change the cache key in order to create a fresh cache.
cache should at least be invalidated/expired to avoid recurring errors.
In this case the cache was corrupt and only able to be partially restored. buildx blew up in a later step due to this. So as long as the action detected this issue and removed all the cache then that would be an acceptable fix.
Oh, I didn't realize that the cache was partially restored. It didn't cause any issue in my case, but I'd agree that the step should rather fail than leaving inconsistent data behind.
@macropin We currently don't fail the workflow even if cache gets error as not being able to cache or restore doesn't usually make the workflow misbehave. For corrupted cache, only solution currently is to change the cache key and re-run the workflow.
even if cache gets error as not being able to cache or restore doesn't usually make the workflow misbehave
The Docker builder (buildx) does not handle a corrupted cache. It fails with very unclear error messages.
I guess this follows same pattern as where the cache action does not fail the workflow even if no cache found or failed in downloading cache.
Regarding invalidating a cache, today the invalidation would happen with eviction, but yes the eviction timeline can be as long as 7 days. We are looking into a possibility of a deletion REST API would allow someone to do quicker manual fix. Will update if and once this gets done.
I guess this follows same pattern as where the cache action does not fail the workflow even if no cache found or failed in downloading cache.
If actions/cache is not going to raise an error when the cache is corrupt, then saner way to handle this would be to ensure that a corrupt / truncated cache is never restored. The action should restore the cache completely or not at all.
Or the action could report whether the cache was successfully restored or not via an output, that way users could decide for themselves whether to proceed or not.
Or the action could report whether the cache was successfully restored or not via an output, that way users could decide for themselves whether to proceed or not.
That is already possible with cache-hit output. It is set if a cache has been restored with exact matching key. If it is not set, then it is a good idea to run the install/build steps.
The action should restore the cache completely or not at all.
Agree. Any failure in unarchiving should be handled more cleanly.
Btw, there are REST APIs available now to be able to delete a corrupted cache. This can help come out of a situation where a cache is corrupted the only alternative is to update the cache key forcibly.
This issue is stale because it has been open for 200 days with no activity. Leave a comment to avoid closing this issue in 5 days.