git-cached Keep single global object cache; Automatically repack all objects into the cloned repo

This project looks like it's almost exactly what I'm looking for in an object cacheing solution, but I have RTFS'd and found two problems (correct me if I'm wrong):

You seem to use a separate repo per domain (so if I clone the linux kernel from kernel.org, and then clone android's repo, I still end up downloading duplicated blobs)
Damaged caches will screw you up, as described in: http://randyfay.com/node/119

Assuming that I don't care about disk usage (only network), I would like to propose the following modifications (I will call the locally checked out repo LOCAL_REPO):

Switch to using a single global ~/.git_cached/global_cache.git/
Automatically repack all objects into the locally checked out repo (as suggested by randyfay) so that it works as a standalone repo.
After cloning, in ~/.git_cached/global_cache.git/, run git remote add -f local_$(hash $LOCAL_REPO) $LOCAL_REPO. This makes updating the global cache simply a case of running git remote update in ~/.git_cached/global_cache.git

Do these improvements sound sane, or am I off my rocker?

Jan 30 '13 18:01 alsuren

Sorry for hijacking your issues page as a work log. I submitted:

https://github.com/alsuren/git/commit/d49a04e49255c70951fb01c343931f44c4e69566

to #git on freenode, and got the following feedback: [21:20] ~/.cache/git/objects, kthx

[21:20] alsuren_: what's the proposed mechanism for managing the size of this cache? [21:22] <alsuren_> ojacobson: I was thinking of adding the date to the "remote" name whenever you clone a repo and add its objects to the cache [21:23] <alsuren_> ojacobson: so you could just loop over all remotes in the cache repo oldest-first and delete it then garbage collect until you're at the desired size (just an idea)

[21:19] <alsuren_> cmn: the eventual idea of the patch is to help you create a global object cache in ~/.git_object_cache or similar, so that you can clone the linux kernel from upstream, and then clone it from android and not have to think about whether you're downloading it too many times [21:19] that involves too much magic [21:20] how do you know what objects are unreachable? [21:20] why isn't fixing repo so it knows where you keep related repos a simpler option? [21:30] and why is this a simpler option than fixing whatever tooling to know that they can get an object db from elsewhere? [21:33] <alsuren_> cmn: because "fixing whatever tooling to know that they can get an object db from elsewhere" is an ill-defined problem, but any idiot can understand what a cache is [21:33] caches are one of the hard problems [21:33] you can't store random objects and expect a clone to be faster [21:34] you can have as many objects as you want, but if you don't have the right commits, you don't get anything [21:34] --> tworkin has joined this channel ([email protected]). [21:34] and you won't know beforehand what commits you do need [21:35] and talking about every commit that every other repo in your computer has is going to be very expensive [21:35] alsuren_: also, you're doing this to fix some problem you're having, and fixing whatever it is that's causing problems is bound to be easier [21:36] this isn't some distributed database [21:36] you need to have the right commits [21:36] do you know how git push/fetch works? [21:39] <alsuren_> grawity: cmn: which docs should I be reading to find out about the fetch/pull protocol? [21:39] Documentation/technical/ in git source? [21:39] Documentation/technical/pack-protocol.txt

[22:04] <alsuren_> grawity: cmn: so if I specify --object-cache or --reference, my client will send up to 256 * "have ${obj_id}\n" (in 32 line blocks) until it receives "ACK ${obj_id} ready", and then the server will send it a pack with all of the objects that the client doesn't have [22:05] alsuren_: and if you send random commits from the cache, you probably won't match [22:10] <alsuren_> cmn: right, so the algorithm for picking these commits needs to start by sending the root commit for all branches in the repo, and then if any of them return with a hit, send ids according to an intellegent search algorithm [22:11] isn't the point that you don't have a repo? [22:12] but if you have that, why not use the db that you have where you know things are? [22:15] <alsuren_> cmn: because that requires you remembering where you put the damned thing [22:15] that sounds like a very minor problem [22:15] <alsuren_> or indeed that the two projects are forks of each other in the first place

I'm thinking that if we added an extension to protocol-capabilities.txt that asks the server for a list of root commits in the initial handshake (as well as the tip refs), we could use it as the basis of an efficient object cache/lookup/negotiation mechanism.

Will sleep on it and tell you if I have any further brainwaves.

Jan 30 '13 23:01 alsuren

Hey thanks for the interest but keep in mind that I did this to scratch my own itch which is this.

Checkout multiple branches of a project at once without having to pull all the objects x number of times. My main concern was speed. Size is somewhat of a concern but not as much as the speed.

Your conversation is way over my head. This is the first bash script I created and I looked up just enough to be able to accomplish the above points.

If you have ideas to make it even better, I'm all for it. As long as I can keep using it like I'm doing now.

I'd love to look into your suggestion but I'll be very busy for a while but keep the ideas coming and maybe some code? :) Thanks!

Jan 31 '13 20:01 dvessel

You seem to use a separate repo per domain (so if I clone the linux kernel from kernel.org, and then clone android's repo, I still end up downloading duplicated blobs)

I just did it to be safe. It's probably not necessary.

Damaged caches will screw you up, as described in: http://randyfay.com/node/119

It can but there is a repair and no-cache command. The latter simply repacks.

Jan 31 '13 20:01 dvessel