bosh-cli icon indicating copy to clipboard operation
bosh-cli copied to clipboard

Feature/allow remote href blobs

Open a2geek opened this issue 11 months ago • 9 comments

This stems from a question I asked in the Cloud Foundry BOSH Slack channel. I've been wondering why the BOSH CLI for development must use some S3 (or similar) buckets for storing blobs. Why not just allow URL references and download from the source? Many BOSH related artifacts are hosted in Github. Why duplicate that somewhere?

This was just a quick spike to prove the idea.

Note that this only impacts CLI for development usage.

Since this was more of a spike, the HREF uses the Go http.Get (etc) directly. From the rest of the code base, this likely should be behind some interface so it can be faked for direct unit tests.

Is this of interest? What needs to be changed to meet standards? Is there any existing code that should be used? (For example, around the URL handling.)

The following commands were modified:

  • bosh add-blob now accepts a URL, stored as HREF in the blob structures and yaml file.
  • bosh blobs lists the HREF. (This ended up a bit too wide, and I could easily be convinced this is superfluous).
  • bosh sync-blobs defers to the blobstore configuration -- if there is none and the HREF exists, the blob is simply downloaded from the source.

Example repository is here: https://github.com/a2geek/test-release (feel free to compile this BOSH CLI, clone that test release, and run the bosh sync-blobs...)

$ bosh add-blob https://nodejs.org/dist/v20.18.3/node-v20.18.3-linux-x64.tar.xz node/node-v20.18.3-linux-x64.tar.xz
Added blob 'node/node-v20.18.3-linux-x64.tar.xz'

Succeeded
 bosh add-blob https://github.com/adoptium/temurin21-binaries/releases/download/jdk-21.0.6%2B7/OpenJDK21U-jdk_x64_linux_hotspot_21.0.6_7.tar.gz java/OpenJDK21U-jdk_x64_linux_hotspot_21.0.6_7.tar.gz
Added blob 'java/OpenJDK21U-jdk_x64_linux_hotspot_21.0.6_7.tar.gz'

Succeeded
$ bosh blobs
Path                                                   Size     Blobstore ID  Digest                                                                   HREF  
java/OpenJDK21U-jdk_x64_linux_hotspot_21.0.6_7.tar.gz  197 MiB  (local)       sha256:a2650fba422283fbed20d936ce5d2a52906a5414ec17b2f7676dddb87201dbae  https://github.com/adoptium/temurin21-binaries/releases/download/jdk-21.0.6%2B7/OpenJDK21U-jdk_x64_linux_hotspot_21.0.6_7.tar.gz  
nodejs/node-v22.14.0-linux-x64.tar.xz                  28 MiB   (local)       sha256:69b09dba5c8dcb05c4e4273a4340db1005abeafe3927efda2bc5b249e80437ec  https://nodejs.org/dist/v22.14.0/node-v22.14.0-linux-x64.tar.xz  

2 blobs

Succeeded
$ tree .
.
├── blobs
│   ├── java
│   │   └── OpenJDK21U-jdk_x64_linux_hotspot_21.0.6_7.tar.gz
│   └── nodejs
│       └── node-v22.14.0-linux-x64.tar.xz
├── config
│   ├── blobs.yml
│   └── final.yml
├── jobs
├── packages
└── src

8 directories, 4 files
$ rm -rf blobs/*
$ tree .
.
├── blobs
├── config
│   ├── blobs.yml
│   └── final.yml
├── jobs
├── packages
└── src

6 directories, 2 files
$ bosh sync-blobs 
Blob download 'java/OpenJDK21U-jdk_x64_linux_hotspot_21.0.6_7.tar.gz' (207 MB) (id: - sha1: sha256:a2650fba422283fbed20d936ce5d2a52906a5414ec17b2f7676dddb87201dbae) started
Blob download 'nodejs/node-v22.14.0-linux-x64.tar.xz' (30 MB) (id: - sha1: sha256:69b09dba5c8dcb05c4e4273a4340db1005abeafe3927efda2bc5b249e80437ec) started
Blob download 'nodejs/node-v22.14.0-linux-x64.tar.xz' (id: -) finished
Blob download 'java/OpenJDK21U-jdk_x64_linux_hotspot_21.0.6_7.tar.gz' (id: -) finished

Succeeded
$ tree .
.
├── blobs
│   ├── java
│   │   └── OpenJDK21U-jdk_x64_linux_hotspot_21.0.6_7.tar.gz
│   └── nodejs
│       └── node-v22.14.0-linux-x64.tar.xz
├── config
│   ├── blobs.yml
│   └── final.yml
├── jobs
├── packages
└── src

8 directories, 4 files

a2geek avatar Feb 27 '25 17:02 a2geek

Question why not just use the local provider and add a dev script to download the blobs. Given this feature won't work with final releases anyway.

rkoster avatar Mar 06 '25 16:03 rkoster

If you're using local blobs, you need to create some system to identify where those blobs are stored/sourced -- either a custom script for every repository, or something generic with the metadata that is contributed to each repository and can be reused. Seems that 'bosh' itself would be correct place to reuse that code instead of every developer having to make up some mechanism.

I see what you mean for the final release. You must declare a provider. So I setup the local provider, and it's good. My goal is to simply change the way we retrieve blobs when developing. Maybe an enhancement to the local provider would be better?

a2geek avatar Mar 06 '25 18:03 a2geek

The final release blobs will sill contain the packages and all there blobs dependencies.

I do really like the idea of first class support for references to the original upstream artifacts. But then more from a provenance point of view. Ideally this could be used to verify that blobs match with upstream.

rkoster avatar Mar 06 '25 18:03 rkoster

But then more from a provenance point of view. Ideally this could be used to verify that blobs match with upstream.

node/node-v20.18.3-linux-x64.tar.xz:
  size: 25810368
  sha: sha256:595bcc9a28e6d1ee5fc7277b5c3cb029275b98ec0524e162a0c566c992a7ee5c

Wouldn't this be one of the purposes of the embedded SHA256? Just trying to clarify.

Poking through the code, switching the local provider over to being the rebuild looks to be a larger change than I'd expect. Everything works off the blob id and not the Blob structure. (Meaning the URL gets dropped somewhere along the way.)

a2geek avatar Mar 06 '25 19:03 a2geek

Correct, the SHA is there to verify that, but sometimes it's difficult to know where a blob actually came from. Basically it would be great if a build system could independently verify that a blob when downloaded from the original source produces the same SHA.

rkoster avatar Mar 07 '25 08:03 rkoster

Hoping you can clarify what you're thinking of from the CLI. I assume bosh add-blob http://... blobref to identify the URL. How would the validation work? (Obviously not for every sync-blobs command... maybe a flag of some sort?)

a2geek avatar Mar 14 '25 14:03 a2geek

Yeah some flag on bosh create-release --final --validate-blob-origin, which would validate that the blobs match their origins.

rkoster avatar Mar 25 '25 08:03 rkoster

First stab:

$ bosh-cli create-release --validate-blob-origin
Blob download 'java/OpenJDK21U-jdk_x64_linux_hotspot_21.0.6_7.tar.gz' (207 MB) (id: - sha1: sha256:a2650fba422283fbed20d936ce5d2a52906a5414ec17b2f7676dddb87201dbae) started
Blob download 'java/OpenJDK21U-jdk_x64_linux_hotspot_21.0.6_7.tar.gz' (id: -) finished
Blob download 'node/node-v20.18.3-linux-x64.tar.xz' (26 MB) (id: - sha1: sha256:595bcc9a28e6d1ee5fc7277b5c3cb029275b98ec0524e162a0c566c992a7ee5c) started
Blob download 'node/node-v20.18.3-linux-x64.tar.xz' (id: -) finished

Added dev release 'test/v2+dev.1'

Name         test  
Version      v2+dev.1  
Commit Hash  2c76fc1  

Job                                                                            Digest                                                                   Packages  
does-nothing/1ca7516f74d7a497f78785b924bc3691c14a9e63a5905134ffb6d0b6158f4687  sha256:c1019226ae0a52a442c05eed96d3f90b8b2942d20e67eacb031d892c136c360a  java  
                                                                                                                                                        node  

1 jobs

Package                                                                Digest                                                                   Dependencies  
java/889c392818ee6efc9e38f3db86b55d757d068d70a9087f95ee628eefc3751fec  sha256:6b0c68cb8ea090112af7b2948ef9843cd23a9c85c98710758ecd1e4595e04d35  -  
node/1dc0f3b044375b6865258b9207d6a12695baac376834090e60d86e1f3a5ee231  sha256:7e66dc70d7c9001346bc7375273c586455301c4f40407c79d315e04dddd994d1  -  

2 packages

Succeeded

... and manually botched one of the SHA codes:

$ bosh-cli create-release --validate-blob-origin
Blob download 'java/OpenJDK21U-jdk_x64_linux_hotspot_21.0.6_7.tar.gz' (207 MB) (id: - sha1: sha256:a2650fba422283fbed20d936ce5d2a52906a5414ec17b2f7676dddb87201dbaf) started

Blob download 'java/OpenJDK21U-jdk_x64_linux_hotspot_21.0.6_7.tar.gz' (id: -) failed

Validating SHA for 'https://github.com/adoptium/temurin21-binaries/releases/download/jdk-21.0.6%!B(MISSING)7/OpenJDK21U-jdk_x64_linux_hotspot_21.0.6_7.tar.gz':
  Expected stream to have digest 'sha256:a2650fba422283fbed20d936ce5d2a52906a5414ec17b2f7676dddb87201dbaf' but was 'sha256:a2650fba422283fbed20d936ce5d2a52906a5414ec17b2f7676dddb87201dbae'

Exit code 1

I'm just using the current reporting code. I removed an HREF from one blob and it's currently listing as an error. I'll try to make that more of a warning instead. If you want to validate the source, but the source isn't configured, thinking it's not an error, just a warning that it can't be validated.

a2geek avatar Mar 26 '25 15:03 a2geek

Great progress!! Yeah, it makes sense to start with a warning, once this feature has been adopted more broadly, we can always add another flag (like --expect-all-origins).

At some point, it would also be good to think about how to express the fact that blob origin was checked during release creation in the final release metadata.

rkoster avatar Mar 27 '25 08:03 rkoster

This should be ready for review.

@a2geek would be great if you could resolve the merge conflicts.

beyhan avatar Apr 24 '25 14:04 beyhan

I think I got it resolved!

a2geek avatar Apr 28 '25 21:04 a2geek

moved to review section

aramprice avatar Jun 13 '25 19:06 aramprice

My original intent with this change was inspired by (what I thought) was the fact that the only blobs that go in the blob store were the binaries that we attach while developing. I have found that this is not true. In shifting between computers, I realized all the "internal" packages and jobs are also blobs that are stored -- and that the BOSH CLI cannot recover (at least not easily) if those blobs don't exist; even for prior versions of those blobs. Due to these realizations, I don't think this change could have ever worked. We all need to use a blob store, and a commercial one if we are doing public work.

Sorry for the chase!

Cancelling the PR.

a2geek avatar Jun 26 '25 16:06 a2geek

@a2geek - no worries, thank you for the reply and for digging into this.

If you ever feel like working on the "metadata pointing to the upstream source" aspect that is part of this PR that would definitely be valuable.

In any case welcome to the bosh world, and thanks for raising this!

aramprice avatar Jun 26 '25 16:06 aramprice