bob icon indicating copy to clipboard operation
bob copied to clipboard

bundle

Open rhubert opened this issue 1 year ago • 19 comments

Bundle the results of all checkoutSteps needed to build a package and provide a bundle-config file to be able to rebuild the package only with the bundle.

Such a bundle can be used to build on a air-gapped system or to archive all sources of a build. When building from the bundle only the bundle is extracted but no other checkoutScripts are run.

There is one issue with the actual URL-Scm extraction. The source workspace contains both, the original downloaded file and the extracted sources. This unnecessarily doubles the size of the bundle and - since the bundle-extraction uses the UrlScm as well - produces a different workspace hash when the bundle is extracted. That's why I changed to download the original file into workspace/../_download where also the .extracted file is placed. This change makes the unittest-failing ATM as the ../_download folder is always the same when using a tempDir.

I'm not sure how to proceed here:

  • fix the unit-tests
  • use a workspace folder (e.g. workspace/.bob-download/) for the downloaded files and exclude this directory when hashing / bundling?
  • leave the download location as is and ignore the downloaded + .extracted file?
  • ...?

rhubert avatar Sep 23 '24 12:09 rhubert

I would argue that the tarball download optimization is some welcome but unrelated optimization. I would move it to some separate PR that should probably be merged first.

Reusing the URL SCM for the bundles is IMHO not the right approach. It should instead work like the archive stuff. Right now binary artifacts are used for saving or restoring package steps. What we need here is to save and restore checkout steps. In the best case we can build on the archive module and reuse most code from there.

From a more general angle: should this work for indeterministic checkouts too? I would argue against this and only bundle deterministic checkouts. But if there is a good reason to decide otherwise I'm open to it. It's just that my gut feeling is that it will get nasty to get all corner cases correct...

jkloetzke avatar Sep 26 '24 19:09 jkloetzke

Codecov Report

:x: Patch coverage is 96.51163% with 6 lines in your changes missing coverage. Please review. :white_check_mark: Project coverage is 89.11%. Comparing base (6d75dfb) to head (4cf0a21).

Files with missing lines Patch % Lines
pym/bob/archive.py 96.69% 4 Missing :warning:
pym/bob/builder.py 96.55% 1 Missing :warning:
pym/bob/cmds/build/build.py 95.45% 1 Missing :warning:
Additional details and impacted files
@@            Coverage Diff             @@
##           master     #586      +/-   ##
==========================================
+ Coverage   89.03%   89.11%   +0.08%     
==========================================
  Files          50       50              
  Lines       16137    16290     +153     
==========================================
+ Hits        14367    14517     +150     
- Misses       1770     1773       +3     

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

:rocket: New features to boost your workflow:
  • :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

codecov[bot] avatar Sep 30 '24 09:09 codecov[bot]

I added a new implementation using the archive stuff. Since adding files to the final bundle using tarfile can not be done in parallel this finalization step is necessary. Maybe this could be optimized somehow but it seams to be impossible to synchronize this asyncio stuff. It's either unable to pickle asyncio.future objects, or the lock is generated in the wrong loop. :shrug: With this finalization method it's somehow simple and works - I don't think time is that much relevant when a bundle is packed.

To get the blackbox test working #606 is required. As of today I haven't tested bundling / unbundling a larger, real world project. I'll do this in the next days.

rhubert avatar Jan 04 '25 17:01 rhubert

I thought long about this and I have no final idea yet. So far, just some random thoughts:

  • We should probably use a zip-file. This has at least the benefit that we don't have to extract the whole bundle up-front when reading from it.
  • Your problems with asyncio come from the fact that all up- and downloads are executed in separate processes. They truly run in parallel and are forks of the original Bob pyhton process.
  • Do we really have to fail on indeterministic checkouts? What if we just ignore them?

IMHO, bundling the sources is nothing special compared to up- or downloading dist workspaces. In fact, I could imagine it would make sense to have this feature for all other transports too. Maybe not as default but as option. As such, what about defining new src-upload and src-download archive flags? If you set them explicitly for a backend, the sources are up- and downloaded.

In the end, the bundle option is just another way of defining a local source upload backend. It's just special in the way that it will create a big zip file in the end. That zip file could also hold "normal" binary artifacts. Not sure if this orthogonal design yields something useful but in principle it looks feasible.

I'm not yet sure what we should do about possible git or svn directories in source workspaces. As they are ignored when hashing the workspace we could skip them when packing the sources too...

jkloetzke avatar Jan 29 '25 20:01 jkloetzke

  • Do we really have to fail on indeterministic checkouts? What if we just ignore them?

:shrug: I don't use indeterministic checkouts so I don't have a strong opinion here. I see valid arguments for all 3 options, so maybe we should add a bundle-indeterministic=[fail,ignore,add] option to move this decision to the user?

IMHO, bundling the sources is nothing special compared to up- or downloading dist workspaces. In fact, I could imagine it would make sense to have this feature for all other transports too. Maybe not as default but as option. As such, what about defining new src-upload and src-download archive flags? If you set them explicitly for a backend, the sources are up- and downloaded.

In the end, the bundle option is just another way of defining a local source upload backend. It's just special in the way that it will create a big zip file in the end. That zip file could also hold "normal" binary artifacts. Not sure if this orthogonal design yields something useful but in principle it looks feasible.

Holding it like this we don't even need the --bundle option, right? The user only has to define a archive backend with src-upload and build with --always-checkout to get the backend populated with all the sources. The bundle-indeterminstic option would than become a src-upload-indeterministic flag?

I'm not yet sure what we should do about possible git or svn directories in source workspaces. As they are ignored when hashing the workspace we could skip them when packing the sources too...

Same as for the indeterministic packages. I think there are use cases where it could be useful to have these directories on the air-gaped machine, e.g. when development takes place there. If it's bundled just for review we can skip them, so I'd add another option here to have this configurable.

rhubert avatar Jan 30 '25 06:01 rhubert

Holding it like this we don't even need the --bundle option, right? The user only has to define a archive backend with src-upload and build with --always-checkout to get the backend populated with all the sources.

I guess the --bundle option might still make sense. It will shorten the steps (define temporary backend with right flags, enable always-checkout, enable source upload).

The bundle-indeterminstic option would than become a src-upload-indeterministic flag?

Sounds reasonable. Also the source up-/download should probably be separate switches.

I'm not yet sure what we should do about possible git or svn directories in source workspaces. As they are ignored when hashing the workspace we could skip them when packing the sources too...

Same as for the indeterministic packages. I think there are use cases where it could be useful to have these directories on the air-gaped machine, e.g. when development takes place there. If it's bundled just for review we can skip them, so I'd add another option here to have this configurable.

Sounds reasonable. I think that we should throw it away by default because it will bloat the archive. But if somebody wants to retain the git-directories, why not...

jkloetzke avatar Feb 01 '25 20:02 jkloetzke

I've created a 1.0 milestone for mid of April because a new release is overdue IMHO. Do you think this PR should be part of it? I'm not entirely sure because it feels the design is not really settled yet. What's your opinion?

jkloetzke avatar Mar 23 '25 10:03 jkloetzke

I did not continue working on this to avoid conflicts with the remote archive clean PR. Mid of April (which year?) might be possible with some priority shifting on my side... but it wouldn't be a big issue if this lands after 1.0.

rhubert avatar Mar 24 '25 05:03 rhubert

Mid of April (which year?)

Obviously this year. Otherwise I wouldn't have asked. :laughing: :wink:

might be possible with some priority shifting on my side... but it wouldn't be a big issue if this lands after 1.0.

Then I would tentatively shift this feature to the next release if there is no need to hurry...

jkloetzke avatar Mar 24 '25 19:03 jkloetzke

All in all I'm not sure how the interaction between bundles- and non-bundles-builds should work. That is, a previous build used one method and the next the other approach. There are 4 possibilities:

1. Bundles build -> bundles build: The bundle was extracted in the first build. The next build will skip the checkout step.

2. Regular build -> regular build: as always

3. Regular build -> bundles build: should this do nothing? Obviously the bundle should not be extracted again. But what if the bundle was updated for an indeterministic checkout? We cannot overwrite source code.

I'd say deterministic packages will stay as they are as long as they are not changed. For indeterministic packages or if the checkout was updated there might be some attic logic needed...

4. Bundles build -> regular build: I think we must prevent the SCMs from running. Otherwise they will immediately collide with the existing files in the workspace.

IMO we also need to move the workspace to attic and run the scm-checkout.

My gut feeling is that these corner cases need more thought...

I don't have a use-case for anything but 1 and 2. Maybe we can just forbid 3 and 4 somehow until there is a need for it?

BTW - I'm doing some testing with a real project ATM. Besides two packages triggering the "Hash doesn't match after unbundle" error (I'm look into it) it was necessary to disable the ci.bobbuildtool.dev archive in basement on my air gapped machine. Otherwise I got BuildError: http://ci.bobbuildtool.dev/artifacts/: Cannot download file: [Errno -3] Temporary failure in name resolution raised on the first QUERY step. Do we really need to raise a build error on OsErrors in downloadPackage?

Also it was necessary to change the layers from managed, ...

rhubert avatar May 19 '25 19:05 rhubert

The failure on name resolution errors should be fixed with the current master. Does it still happen with the latest version? They should be treated like the artifact was not found by now.

jkloetzke avatar May 20 '25 14:05 jkloetzke

Yes, this happens on master + bundle - basically the current state of this PR.

-> https://github.com/BobBuildTool/bob/blob/master/pym/bob/archive.py#L459

I did a quick test on my system and it seams the handling of socket.gaierror is missing there...

rhubert avatar May 20 '25 16:05 rhubert

Yes, this happens on master + bundle - basically the current state of this PR.

-> https://github.com/BobBuildTool/bob/blob/master/pym/bob/archive.py#L459

Grmpf. You looked (like me) at the wrong spot because the error is handled 5 lines above. But there is another path for live build-id. Hopefully fixed with https://github.com/BobBuildTool/bob/commit/5c7e356d1d2d798c328fda1a2f6c177f059774dd.

jkloetzke avatar May 20 '25 18:05 jkloetzke

I'd say deterministic packages will stay as they are as long as they are not changed. For indeterministic packages or if the checkout was updated there might be some attic logic needed...

Sounds good. So whenever we change between bundle-/non-bundle mode the sources will be moved to the attic.

I'm unsure what to do of the indeterministic bundle packages. Is this a use case on your side? If not, I would forbid them.

jkloetzke avatar May 23 '25 19:05 jkloetzke

All in all I'm not sure how the interaction between bundles- and non-bundles-builds should work.

I think we have the same question/issue for regular src-upload / src-download vs. scm-checkouts? If the sources are downloaded in first place but couldn't be downloaded after a recipe change we can not simply run the regular checkout in this workspace. (Also extracting the new sources in the same workspace could lead to a different result.)

So every time a checkout should run we need to move modified existing sources to attic and run the checkout in a fresh workspace. Unmodified sources can be deleted? Or do I miss something?

rhubert avatar Jun 10 '25 08:06 rhubert

So every time a checkout should run we need to move modified existing sources to attic and run the checkout in a fresh workspace.

Right. Sounds like the safest and easiest option. We should prevent re-runs for downloaded indeterministic checkouts though. Unless the bundle is changed, it will yield the same result and just create churn.

BTW, are indeterministic bundled checkouts needed?

Unmodified sources can be deleted? Or do I miss something?

In principle yes. Currently we don't make a checksum before the update. That would be a new case for downloaded source workspaces. I wouldn't mind if we just consistently move the workspace to the attic instead if that turns out to be hard to implement.

jkloetzke avatar Jun 11 '25 20:06 jkloetzke

BTW, are indeterministic bundled checkouts needed?

I don't have a usecase for them

rhubert avatar Jun 12 '25 05:06 rhubert

I found some time to continue on this, but now I'm stuck. For the tests I'd like to test that we only bundle sources as they are referenced by the recipes without any local modifications. But for url-scms the clean-checkout has no effect? (https://github.com/BobBuildTool/bob/pull/586/files#diff-35d1d446d0498b0359b9232bd9c7ec4cbe21a206834fd893aa26b28fdeab20fdR125)

rhubert avatar Sep 25 '25 16:09 rhubert

But for url-scms the clean-checkout has no effect?

No, it doesn't. Bob relies on git and svn to tell us when something was changed. For the URL SCM, no state is tracked internally. :/

jkloetzke avatar Sep 25 '25 19:09 jkloetzke