Malformed cache-entries when running multiple replicas
Hello 👋
We get the following error during cache-load in our jobs:
/*stdin*\ : Read error (39) : premature end
/usr/bin/tar: Unexpected EOF in archive
/usr/bin/tar: Unexpected EOF in archive
/usr/bin/tar: Error is not recoverable: exiting now
Warning: Failed to restore: "/usr/bin/tar" failed with error: The process '/usr/bin/tar' failed with exit code 2
The error seems to be somewhat similar to those reported in https://github.com/falcondev-oss/github-actions-cache-server/issues/54 but still differs and the cause seems to be different (atleast from my testing).
Some information on our setup:
- we run the cache-server in a Kubernetes-Cluster
- we run 3 replicas of the cache-server
- we use a postgres-database (in Cluster via the https://github.com/zalando/postgres-operator)
- we use Minio S3 for Storage (in Cluster via the https://github.com/minio/operator)
As we only face these problems on some of our projects I tested around to narrow down the cause. These are my findings:
It seems like the stored cache-archive is actually incomplete
- For the affected cache-entries the storage-size reported by minio does not match the Cache-Size reported in the Job-Logs on cache-upload
- The Cache-Size reported in the Job-Logs on cache-download is less than the size reported on upload
- When the size reported by Minio matches the size in the upload-logs the error does not occur
It seems to be related to the replicated setup
- when we scale the replicas down to 1 (and delete & recreate existing cache-entries) the error does not occur (although with 2 replicas is occurs nearly every time)
It seems to be related to cache size / multipart uploads
- all jobs on which we are facing these problems are (trying) to cache archives larger than 64MB which also is the default part-size for multipart uploads in the minio-client (see the
partSize-option on https://min.io/docs/minio/linux/developers/javascript/API.html#constructor )
My suspicion is, that this could be caused by the in-memory uploadFileBuffers (https://github.com/falcondev-oss/github-actions-cache-server/blob/dev/lib/storage/index.ts#L38 ) which might lead to problems when a cache-upload is chunked and send via different replicas of the cache-servers.
I understand that this might be somewhat of an niche problem because there is no documented way or example of running the cache-server in a replicated setup. Also we scaled down to a single instance as a workaround for now. But I expect that with the addition of a helm chart to this repo (https://github.com/falcondev-oss/github-actions-cache-server/pull/58) which enables autoscaling and running multiple replicas more people will face the same problems.
hey @Kleinkind , I trying to set up the cache server and failing at it tried to contact you but couldn't find any medium for it so commenting here, not an usual way to communicate with another engineer. kind of in a hurry so if you have some time, then could you please help me with the setup
@Kleinkind I found several issues with the proposed helm chart on that PR https://github.com/falcondev-oss/github-actions-cache-server/pull/58, but I think it could be a good starting point.
Anyway, currently I'm running the cache server with 2 replicas and persistentVolume disabled. This is because I noticed that the app only uses the ephemeral volume mounted under /tmp and not the one mounted in /app/.data.
Even with just 1 replica, I still encounter the same error of issue https://github.com/falcondev-oss/github-actions-cache-server/issues/54. I believe the different hashes are partially to blame, but I don't think it's solely a multi-replica issue
hey @matteovivona need a little help in setting up the cache server my runners are not picking up the server url followed all the steps from the docs tried to debug a lot couldn't find what's wrong
could you please help me with it? how can I contact you
hey @Kleinkind , I trying to set up the cache server and failing at it tried to contact you but couldn't find any medium for it so commenting here, not an usual way to communicate with another engineer. kind of in a hurry so if you have some time, then could you please help me with the setup
I am not a maintainer of this project and your comment has nothing to do with this issue. If you are facing problems I think the best way is to create an issue on your own and ask for help there.
okay, creating an issue then
Thanks for reporting and debugging this! I'm a bit short on time right now but I'll take a look this weekend. @Kleinkind
okay, creating an issue then
pls open a new issue 🙏
@Kleinkind pls take a look at #67 and let me know what would work best for you or whether you have any other ideas 🙏
Clustering support added in https://github.com/falcondev-oss/github-actions-cache-server/releases/tag/v4.0.0