git-lfs icon indicating copy to clipboard operation
git-lfs copied to clipboard

git lfs fetch a range of commits

Open richard-fine opened this issue 1 year ago • 2 comments

Describe the issue I'm trying to copy a single branch between two remotes - pulling from one and pushing to the other - and this means I also need to pull the LFS files associated with the branch. Without them I get errors when I try to push, as you'd expect.

I cannot use --all because there are terabytes of LFS files in the source repo associated with other branches that I'm not interested in.

I'd hoped git lfs fetch might support a revision range, so I could do git lfs fetch origin <branch_start>..<branch_tip> but this doesn't seem to work.

The error messages I get when I push tell me which OIDs I'm missing, but I cannot see any easy way to figure out which commit they are being referenced by, so I can't easily turn them into revs to pass to git lfs fetch.

I've tried a simple for rev in $(git rev-list <branch_start> <branch_tip>); do git lfs fetch $rev; done and it seems to work but it's horrendously slow, I assume because fetch is scanning the entirety of the tree on each revision and is not able to exploit the fact that most tree hashes do not change between revisions.

It looks like my best path forward might be to write my own tool that

  • Uses git rev-list to get all revisions in my branch
  • Walks the trees for each revision using git check-attr to collect all LFS pointers at each rev and parse to extract the OID
  • Filter out OIDs which are already in my LFS storage
  • Pass the remaining revisions that have 1+ missing OID to git lfs fetch

which of course I can do, but it feels like I'm re-implementing a nontrivial chunk of what LFS fetch actually does. So I'm wondering if I missed something - is there an easier way to do this?

System environment Problem exists on both Windows and macOS.

Output of git lfs env

git-lfs/3.4.0 (GitHub; windows amd64; go 1.20.6; git d06d6e9e)
git version 2.43.0.windows.1

richard-fine avatar May 18 '24 18:05 richard-fine

Hey,

I'm not sure of a good way to go about doing this in an efficient way. You probably want to use git rev-list A..B --not --remotes=dest (where dest is the destination remote), which will make this a lot more efficient and avoid traversing all objects that are already on the destination, but it's still not going to be screamingly performant.

We internally use git cat-file --batch to make it more efficient to find the objects without spawning a large number of Git processes, which you can do, too. You can also use git cat-file --batch-check first to find those items which are pointer files (which must be less than 1024 bytes), since sometimes people mark a file as an LFS file and then push the large object anyway. However, this will likely require more work than a simple shell one-liner, so you might want to write something like a Ruby script to handle this.

I think what you want here for scripting is an equivalent to git lfs push's --object-id flag, which unfortunately doesn't exist yet. It shouldn't be too hard to add if you're interested, but it's ultimately going to be rather difficult to handle as part of scripting without adding that functionality.

bk2204 avatar May 22 '24 13:05 bk2204

Yeah, I ended up solving this using a Python script which did something along the lines of my original post, except that rather than passing revisions to git lfs fetch, it was easier to directly pipe LFS pointer file content to git lfs smudge (discarding the output, but taking advantage of the fact that it caused the object to be downloaded).

And yes, supporting --object-id for git lfs fetch would be nice and would avoid needing to do the smudge shenanigans. Ideally LFS would expose facilities for synchronizing remote/local LFS object stores without needing to touch commits - push --object-id is half of that story, but we're missing a corresponding fetch piece.

richard-fine avatar May 22 '24 14:05 richard-fine