github-actions-cache-server Malformed cache-entries when running multiple replicas

Hello 👋

We get the following error during cache-load in our jobs:

/*stdin*\ : Read error (39) : premature end 
/usr/bin/tar: Unexpected EOF in archive
/usr/bin/tar: Unexpected EOF in archive
/usr/bin/tar: Error is not recoverable: exiting now
Warning: Failed to restore: "/usr/bin/tar" failed with error: The process '/usr/bin/tar' failed with exit code 2

The error seems to be somewhat similar to those reported in https://github.com/falcondev-oss/github-actions-cache-server/issues/54 but still differs and the cause seems to be different (atleast from my testing).

Some information on our setup:

we run the cache-server in a Kubernetes-Cluster
we run 3 replicas of the cache-server
we use a postgres-database (in Cluster via the https://github.com/zalando/postgres-operator)
we use Minio S3 for Storage (in Cluster via the https://github.com/minio/operator)

As we only face these problems on some of our projects I tested around to narrow down the cause. These are my findings:

It seems like the stored cache-archive is actually incomplete

For the affected cache-entries the storage-size reported by minio does not match the Cache-Size reported in the Job-Logs on cache-upload
The Cache-Size reported in the Job-Logs on cache-download is less than the size reported on upload
When the size reported by Minio matches the size in the upload-logs the error does not occur

It seems to be related to the replicated setup

when we scale the replicas down to 1 (and delete & recreate existing cache-entries) the error does not occur (although with 2 replicas is occurs nearly every time)

It seems to be related to cache size / multipart uploads

all jobs on which we are facing these problems are (trying) to cache archives larger than 64MB which also is the default part-size for multipart uploads in the minio-client (see the partSize -option on https://min.io/docs/minio/linux/developers/javascript/API.html#constructor )

My suspicion is, that this could be caused by the in-memory uploadFileBuffers (https://github.com/falcondev-oss/github-actions-cache-server/blob/dev/lib/storage/index.ts#L38 ) which might lead to problems when a cache-upload is chunked and send via different replicas of the cache-servers.

I understand that this might be somewhat of an niche problem because there is no documented way or example of running the cache-server in a replicated setup. Also we scaled down to a single instance as a workaround for now. But I expect that with the addition of a helm chart to this repo (https://github.com/falcondev-oss/github-actions-cache-server/pull/58) which enables autoscaling and running multiple replicas more people will face the same problems.

Oct 09 '24 08:10 Kleinkind

hey @Kleinkind , I trying to set up the cache server and failing at it tried to contact you but couldn't find any medium for it so commenting here, not an usual way to communicate with another engineer. kind of in a hurry so if you have some time, then could you please help me with the setup

Oct 09 '24 10:10 asutosh23

@Kleinkind I found several issues with the proposed helm chart on that PR https://github.com/falcondev-oss/github-actions-cache-server/pull/58, but I think it could be a good starting point.

Anyway, currently I'm running the cache server with 2 replicas and persistentVolume disabled. This is because I noticed that the app only uses the ephemeral volume mounted under /tmp and not the one mounted in /app/.data. Even with just 1 replica, I still encounter the same error of issue https://github.com/falcondev-oss/github-actions-cache-server/issues/54. I believe the different hashes are partially to blame, but I don't think it's solely a multi-replica issue

Oct 09 '24 11:10 matteovivona

hey @matteovivona need a little help in setting up the cache server my runners are not picking up the server url followed all the steps from the docs tried to debug a lot couldn't find what's wrong

could you please help me with it? how can I contact you

Oct 09 '24 11:10 asutosh23

hey @Kleinkind , I trying to set up the cache server and failing at it tried to contact you but couldn't find any medium for it so commenting here, not an usual way to communicate with another engineer. kind of in a hurry so if you have some time, then could you please help me with the setup

I am not a maintainer of this project and your comment has nothing to do with this issue. If you are facing problems I think the best way is to create an issue on your own and ask for help there.

Oct 09 '24 11:10 Kleinkind

okay, creating an issue then

Oct 09 '24 11:10 asutosh23

Thanks for reporting and debugging this! I'm a bit short on time right now but I'll take a look this weekend. @Kleinkind

Oct 09 '24 11:10 LouisHaftmann

okay, creating an issue then

pls open a new issue 🙏

Oct 09 '24 11:10 LouisHaftmann

@Kleinkind pls take a look at #67 and let me know what would work best for you or whether you have any other ideas 🙏

Oct 14 '24 10:10 LouisHaftmann

Clustering support added in https://github.com/falcondev-oss/github-actions-cache-server/releases/tag/v4.0.0

Nov 20 '24 21:11 LouisHaftmann