Describe multiple SBOM scan targets
What would you like to be added: Be able to specify multiple targets that where one or more SBOMs are created. Take the following examples for illustrative purposes:
# syft.yaml
inputs:
- type: image
id: my-image-sbom
value: docker.io/me/my-image:latest
format: spdxjson
- type: directory
id: my-source-sbom
value: ./src
format: spdx
This would allow for scanning an artifact and source and produce two different sboms, such that in CI invocation would simply be:
# syft.yaml is automatically assumed...
syft
# ...output "my-image-sbom.json" and "my-source-sbom.spdx" files
You could combine the output from multiple cataloging efforts into the same SBOM by using the same id for each input:
# syft.yaml
inputs:
- type: image
id: my-sbom
root-package: container
value: docker.io/me/my-image:latest
format: spdxjson
- type: directory
root-package: source
id: my-sbom
value: ./src
Where the result would be a single my-sbom.json in the spdxjson output. Additionally, anything found in the container will have a relationship tied to a phantom "container" package and anything in the source scanning would have a relationship to a phantom "source" package.
I'm not 100% in love with the proposed format above as it would be easy to abuse when it comes to combining incompatible formats, but it suits for illustrative purposes.
We could surface a small set of this functionality via the CLI by allowing for multiple scan targets:
syft dir:./ image:docker.io/me/my-image:latest -o spdxjson
Why is this needed: For more complicated workflows it would be ideal to encode what needs to be cataloged into a description instead of relying on the consumer to orchestrate multiple syft calls with bash.
Additionally there is no way to deal with "multiple" SBOMs with syft, or grouping related items with relationships, which could be a powerful pattern.
Another approach to the output here would be to allow for syft to take multiple images as input or a multi-arch image as input and stream multiple SBOM documents to the file in question. How this could for for each format:
- table: simply output multiple tables, with an additional header to list which image is being processed
- json, spdx-json, cyclonedx-json: require single line output, treat the document as JSONLs
- spdx-tag-value: not supported
- cyclonedx-xml: xml already already supports multiple embedded tags in a single doc
This dodges the problem of needing to solve how multiple sources are handled in a single SBOM, and instead this can be handled in something that intentionally takes multiple SBOMs for merging (for example syft merge sbom1.json sbom2.json).
This impacts #617 #3562 #562
Another possibility is to do the following:
- Make the "source" part of the SBOM be the source of the many images, e.g. a kubernetes manifest or a multi-image OCI manifest
- Make artifacts in the SBOM with purl type
pkg:ocifor each image we found - Use relationships and nesting to show which packages come from which image.
The advantage of this is that it can be done in all the SBOM specs.
To move this forward I think the easiest path is to first add the CLI approach to this (as args, e.g. syft dir:./ image:docker.io/me/my-image:latest -o spdxjson) then in the future maybe optionally add a more complicated input configuration that allows for selection, additional relationships, and other operations.
Talking to @popey at KubeCon about a multi-language project I need SBOMs produced for. The project has a Go backend and Vue.js frontend (with npm). Currently the backend's SBOMs are produced with ko.build's generator. Having issues with producing an SBOM with Syft against a container image.
I tried with a package.json shipped inside the image via this PR and built with ./hack/publish.sh --local.
The package shows up but not the dependencies.
This came up on a community discussion with @joonas due to a desire to use Syft to catalog Zarf packages. It was my understanding that a Zarf package is basically an archive that has subdirectories containing multiple OCI directories, maybe some other types of files. This is a problem today because Syft only has one location for a single Source.
However, in postulating about solutions, and in light of a number of different open issues related to this, one of the biggest challenges that keeps coming up is: where do we put all the data? In other words: how can we describe multiple sources in a meaningful way?
I think a solution could be to allow sources-as-packages (or possibly SBOMs-as-packages), where we just create new sources to populate the package metadata when we find them, construct new source objects and send them back through the same cataloging procedures, maybe by having a specific SourceCataloger that is able to identify nested sources, surface SourcePackages with the source information along with CONTAINS relationships to the found packages from those sub-scans.
Just to reiterate some of the known challenges:
- we probably need an enhancement to be able to identify nested locations, this could make a simple way to identify root packages
- we need a way to prevent nested packages appearing as directly contained by outer sources in the data model -- this may change an assumption that all packages are associated with the source; format outputs will need to have relationships adjusted and CycloneDX isn't outputting useful source-level relationships today anyway, which needs to be implemented
... assuredly there are other things I'm missing, but that's the gist of our conversation.