git-buildpackage icon indicating copy to clipboard operation
git-buildpackage copied to clipboard

Add pristine-lfs support

Open andrewshadura opened this issue 4 years ago • 10 comments

This pull request adds pristine-lfs support in a similar way to the existing pristine-tar support. Since pristine-lfs has a Python interface, I use that instead of the command-line interface, but in a fail-safe way, providing a plug raising an exception when pristine-lfs cannot be imported.

Unlike pristine-tar, pristine-lfs supports looking up tarballs by a version, so this implementation uses that instead of picking tarballs manually.

If both pristine-lfs and pristine-tar are enabled, tarballs are only imported with pristine-lfs; when checking out, pristine-lfs is preferred, but pristine-tar processed anyway if pristine-lfs is missing the necessary tarball.

This PR is still work in progress, it’s missing documentation and tests. Should I maybe add a subcommand for pristine-lfs too?

I’m also unsure what exceptions to raise depending on the error situation.

andrewshadura avatar Apr 05 '21 20:04 andrewshadura

@andrewshadura thanks! This looks like a useful addition. Thinks i'm unsure about:

  • prefer lfs over tar or the other way around (and why)? (popcon seems to prefer pristine-tar) https://qa.debian.org/popcon-graph.php?packages=pristine-tar+pristine-lfs&show_installed=on&want_legend=on&want_ticks=on&from_date=&to_date=&hlght_date=&date_fmt=%25Y-%25m&beenhere=1
  • fallback mechanism: what if pristine-lfs is better recreating a tarball than pristine-tar (or the other way around) when e.g. there's new compressors etc.
  • should there be a simple way to do both on import (pristine-tar and pristine-lfs)

This PR is still work in progress, it’s missing documentation and tests. Should I maybe add a subcommand for pristine-lfs too?

Hmm...would it make sense to use gbp pristine-tar for that instead of having the user worry about which tool to use?

I’m also unsure what exceptions to raise depending on the error situation.

That does't look too bad from a quick glance.

agx avatar Apr 24 '21 11:04 agx

* prefer lfs over tar or the other way around (and why)? (popcon seems to prefer `pristine-tar`)
  https://qa.debian.org/popcon-graph.php?packages=pristine-tar+pristine-lfs&show_installed=on&want_legend=on&want_ticks=on&from_date=&to_date=&hlght_date=&date_fmt=%25Y-%25m&beenhere=1
* fallback mechanism: what if `pristine-lfs` is better recreating a tarball than `pristine-tar` (or the other way around) when e.g. there's new compressors etc.

So, pristine-lfs and pristine-tar are built on different principles. What is stored:

  • pristine-tar records the hash of the Git tree the tarball corresponds to, the hash of the tarball itself, and a binary delta between its Git export and the actual tarball
  • pristine-lfs records the hash of the tarball only; the actual tarball is stored and transferred out-of-band using Git LFS mechanisms. How tarballs are recreated:
  • pristine-tar only needs the Git repository itself: it uses the Git tree referenced in its metadata and the binary delta to recreate the tarball. It needs to explicitly support each compressor and its quirks to be able to efficiently store the tarball and precisely recreate it
  • pristine-lfs needs the Git repository itself, but it also needs to download binary blobs for each tarball separately. They’re stored inside .git but not as Git objects. When a tarball is checked out, it’s basically downloaded if it’s missing and then copied from the internal storage to the desired location. This means that for off-line operation all tarballs need to be cached beforehand. Conclusions:
  • for a lot of tarballs especially with tricky compressors, when you need just one, the bandwidth to recreate it will be less for pristine-lfs than pristine-tar since pristine-lfs branch only contains text metadata but no tarball data — with pristine-tar the branch will contain all binary deltas for all tarballs, whereas pristine-lfs only downloads tarballs on-demand. (On the other hand, the tarball data is already on the upstream code branch.)
  • when you need all tarball ever imported, pristine-tar will probably save you a lot of bandwidth
  • pristine-lfs doesn’t care about compressors or the tarball format
  • pristine-tar doesn’t require any server-side support, pristine-lfs needs a Git LFS server.

Given the above I think the first try should be pristine-lfs (if available) and then pristine-tar as a fallback.

* should there be a simple way to do both on import (pristine-tar and pristine-lfs)

I agree: --pristine-lfs --pristine-tar on import should do both, not just one.

andrewshadura avatar Dec 17 '21 09:12 andrewshadura

Thanks ore spelling out the details. If we make import-orig smart enough a separate gbp prstine-lfs sounds more useful than somehow folding this into gbp pristine-tar. The only bit missing then is enough config to tell import-orig what to do on import (pristine-tar, pristine-lfs, both) and for export-orig what to prefer/how to fall back. Does that make sense?

agx avatar Dec 31 '21 12:12 agx

@agx, I finally found time for this pull request.

Given how long it’s been around — and I’ve been using it all that time! — I think we should merge the minimal dumb version of it without extra smartness, as described in the last message: import to either pristine-tar or pristine-lfs only when told to, when both are enabled, import to both. Export checks both, prefers pristine-lfs. I haven’t added any configuration for that, but I can’t really think about many situations when this might be necessary — unless someone commits different tarballs to both pristine-lfs and pristine-tar.

I have described the approach and the rationale in the commit message for 531971cb7.

andrewshadura avatar Dec 14 '22 18:12 andrewshadura

By the way, I’ve just noticed the RPM test data repo is gone — is that intentional?

andrewshadura avatar Dec 14 '22 18:12 andrewshadura

@agx, is there any chance to get this merged before the freeze?

andrewshadura avatar Feb 02 '23 21:02 andrewshadura

Seems github ate my reply: Can we have tests for the new options (enabled via GBP_NETWORK_TESTS) that e.g. take an URL and have a small python-gitlab helper that sets up such a repo on salsa. We'd rather give people an easy way to check if things are still working after making changes when introducing a new way of reproducing tarballs.

agx avatar Feb 03 '23 10:02 agx