Feature/allow remote href blobs
This stems from a question I asked in the Cloud Foundry BOSH Slack channel. I've been wondering why the BOSH CLI for development must use some S3 (or similar) buckets for storing blobs. Why not just allow URL references and download from the source? Many BOSH related artifacts are hosted in Github. Why duplicate that somewhere?
This was just a quick spike to prove the idea.
Note that this only impacts CLI for development usage.
Since this was more of a spike, the HREF uses the Go http.Get (etc) directly. From the rest of the code base, this likely should be behind some interface so it can be faked for direct unit tests.
Is this of interest? What needs to be changed to meet standards? Is there any existing code that should be used? (For example, around the URL handling.)
The following commands were modified:
-
bosh add-blobnow accepts a URL, stored as HREF in the blob structures and yaml file. -
bosh blobslists the HREF. (This ended up a bit too wide, and I could easily be convinced this is superfluous). -
bosh sync-blobsdefers to the blobstore configuration -- if there is none and the HREF exists, the blob is simply downloaded from the source.
Example repository is here: https://github.com/a2geek/test-release
(feel free to compile this BOSH CLI, clone that test release, and run the bosh sync-blobs...)
$ bosh add-blob https://nodejs.org/dist/v20.18.3/node-v20.18.3-linux-x64.tar.xz node/node-v20.18.3-linux-x64.tar.xz
Added blob 'node/node-v20.18.3-linux-x64.tar.xz'
Succeeded
bosh add-blob https://github.com/adoptium/temurin21-binaries/releases/download/jdk-21.0.6%2B7/OpenJDK21U-jdk_x64_linux_hotspot_21.0.6_7.tar.gz java/OpenJDK21U-jdk_x64_linux_hotspot_21.0.6_7.tar.gz
Added blob 'java/OpenJDK21U-jdk_x64_linux_hotspot_21.0.6_7.tar.gz'
Succeeded
$ bosh blobs
Path Size Blobstore ID Digest HREF
java/OpenJDK21U-jdk_x64_linux_hotspot_21.0.6_7.tar.gz 197 MiB (local) sha256:a2650fba422283fbed20d936ce5d2a52906a5414ec17b2f7676dddb87201dbae https://github.com/adoptium/temurin21-binaries/releases/download/jdk-21.0.6%2B7/OpenJDK21U-jdk_x64_linux_hotspot_21.0.6_7.tar.gz
nodejs/node-v22.14.0-linux-x64.tar.xz 28 MiB (local) sha256:69b09dba5c8dcb05c4e4273a4340db1005abeafe3927efda2bc5b249e80437ec https://nodejs.org/dist/v22.14.0/node-v22.14.0-linux-x64.tar.xz
2 blobs
Succeeded
$ tree .
.
├── blobs
│ ├── java
│ │ └── OpenJDK21U-jdk_x64_linux_hotspot_21.0.6_7.tar.gz
│ └── nodejs
│ └── node-v22.14.0-linux-x64.tar.xz
├── config
│ ├── blobs.yml
│ └── final.yml
├── jobs
├── packages
└── src
8 directories, 4 files
$ rm -rf blobs/*
$ tree .
.
├── blobs
├── config
│ ├── blobs.yml
│ └── final.yml
├── jobs
├── packages
└── src
6 directories, 2 files
$ bosh sync-blobs
Blob download 'java/OpenJDK21U-jdk_x64_linux_hotspot_21.0.6_7.tar.gz' (207 MB) (id: - sha1: sha256:a2650fba422283fbed20d936ce5d2a52906a5414ec17b2f7676dddb87201dbae) started
Blob download 'nodejs/node-v22.14.0-linux-x64.tar.xz' (30 MB) (id: - sha1: sha256:69b09dba5c8dcb05c4e4273a4340db1005abeafe3927efda2bc5b249e80437ec) started
Blob download 'nodejs/node-v22.14.0-linux-x64.tar.xz' (id: -) finished
Blob download 'java/OpenJDK21U-jdk_x64_linux_hotspot_21.0.6_7.tar.gz' (id: -) finished
Succeeded
$ tree .
.
├── blobs
│ ├── java
│ │ └── OpenJDK21U-jdk_x64_linux_hotspot_21.0.6_7.tar.gz
│ └── nodejs
│ └── node-v22.14.0-linux-x64.tar.xz
├── config
│ ├── blobs.yml
│ └── final.yml
├── jobs
├── packages
└── src
8 directories, 4 files
Question why not just use the local provider and add a dev script to download the blobs. Given this feature won't work with final releases anyway.
If you're using local blobs, you need to create some system to identify where those blobs are stored/sourced -- either a custom script for every repository, or something generic with the metadata that is contributed to each repository and can be reused. Seems that 'bosh' itself would be correct place to reuse that code instead of every developer having to make up some mechanism.
I see what you mean for the final release. You must declare a provider. So I setup the local provider, and it's good. My goal is to simply change the way we retrieve blobs when developing. Maybe an enhancement to the local provider would be better?
The final release blobs will sill contain the packages and all there blobs dependencies.
I do really like the idea of first class support for references to the original upstream artifacts. But then more from a provenance point of view. Ideally this could be used to verify that blobs match with upstream.
But then more from a provenance point of view. Ideally this could be used to verify that blobs match with upstream.
node/node-v20.18.3-linux-x64.tar.xz:
size: 25810368
sha: sha256:595bcc9a28e6d1ee5fc7277b5c3cb029275b98ec0524e162a0c566c992a7ee5c
Wouldn't this be one of the purposes of the embedded SHA256? Just trying to clarify.
Poking through the code, switching the local provider over to being the rebuild looks to be a larger change than I'd expect. Everything works off the blob id and not the Blob structure. (Meaning the URL gets dropped somewhere along the way.)
Correct, the SHA is there to verify that, but sometimes it's difficult to know where a blob actually came from. Basically it would be great if a build system could independently verify that a blob when downloaded from the original source produces the same SHA.
Hoping you can clarify what you're thinking of from the CLI. I assume bosh add-blob http://... blobref to identify the URL. How would the validation work? (Obviously not for every sync-blobs command... maybe a flag of some sort?)
Yeah some flag on bosh create-release --final --validate-blob-origin, which would validate that the blobs match their origins.
First stab:
$ bosh-cli create-release --validate-blob-origin
Blob download 'java/OpenJDK21U-jdk_x64_linux_hotspot_21.0.6_7.tar.gz' (207 MB) (id: - sha1: sha256:a2650fba422283fbed20d936ce5d2a52906a5414ec17b2f7676dddb87201dbae) started
Blob download 'java/OpenJDK21U-jdk_x64_linux_hotspot_21.0.6_7.tar.gz' (id: -) finished
Blob download 'node/node-v20.18.3-linux-x64.tar.xz' (26 MB) (id: - sha1: sha256:595bcc9a28e6d1ee5fc7277b5c3cb029275b98ec0524e162a0c566c992a7ee5c) started
Blob download 'node/node-v20.18.3-linux-x64.tar.xz' (id: -) finished
Added dev release 'test/v2+dev.1'
Name test
Version v2+dev.1
Commit Hash 2c76fc1
Job Digest Packages
does-nothing/1ca7516f74d7a497f78785b924bc3691c14a9e63a5905134ffb6d0b6158f4687 sha256:c1019226ae0a52a442c05eed96d3f90b8b2942d20e67eacb031d892c136c360a java
node
1 jobs
Package Digest Dependencies
java/889c392818ee6efc9e38f3db86b55d757d068d70a9087f95ee628eefc3751fec sha256:6b0c68cb8ea090112af7b2948ef9843cd23a9c85c98710758ecd1e4595e04d35 -
node/1dc0f3b044375b6865258b9207d6a12695baac376834090e60d86e1f3a5ee231 sha256:7e66dc70d7c9001346bc7375273c586455301c4f40407c79d315e04dddd994d1 -
2 packages
Succeeded
... and manually botched one of the SHA codes:
$ bosh-cli create-release --validate-blob-origin
Blob download 'java/OpenJDK21U-jdk_x64_linux_hotspot_21.0.6_7.tar.gz' (207 MB) (id: - sha1: sha256:a2650fba422283fbed20d936ce5d2a52906a5414ec17b2f7676dddb87201dbaf) started
Blob download 'java/OpenJDK21U-jdk_x64_linux_hotspot_21.0.6_7.tar.gz' (id: -) failed
Validating SHA for 'https://github.com/adoptium/temurin21-binaries/releases/download/jdk-21.0.6%!B(MISSING)7/OpenJDK21U-jdk_x64_linux_hotspot_21.0.6_7.tar.gz':
Expected stream to have digest 'sha256:a2650fba422283fbed20d936ce5d2a52906a5414ec17b2f7676dddb87201dbaf' but was 'sha256:a2650fba422283fbed20d936ce5d2a52906a5414ec17b2f7676dddb87201dbae'
Exit code 1
I'm just using the current reporting code. I removed an HREF from one blob and it's currently listing as an error. I'll try to make that more of a warning instead. If you want to validate the source, but the source isn't configured, thinking it's not an error, just a warning that it can't be validated.
Great progress!! Yeah, it makes sense to start with a warning, once this feature has been adopted more broadly, we can always add another flag (like --expect-all-origins).
At some point, it would also be good to think about how to express the fact that blob origin was checked during release creation in the final release metadata.
This should be ready for review.
@a2geek would be great if you could resolve the merge conflicts.
I think I got it resolved!
moved to review section
My original intent with this change was inspired by (what I thought) was the fact that the only blobs that go in the blob store were the binaries that we attach while developing. I have found that this is not true. In shifting between computers, I realized all the "internal" packages and jobs are also blobs that are stored -- and that the BOSH CLI cannot recover (at least not easily) if those blobs don't exist; even for prior versions of those blobs. Due to these realizations, I don't think this change could have ever worked. We all need to use a blob store, and a commercial one if we are doing public work.
Sorry for the chase!
Cancelling the PR.
@a2geek - no worries, thank you for the reply and for digging into this.
If you ever feel like working on the "metadata pointing to the upstream source" aspect that is part of this PR that would definitely be valuable.
In any case welcome to the bosh world, and thanks for raising this!