"Attribute 'path' unset" when accessing the blobs/trees in a tree object returned by git.repo.fun.name_to_object
Example code:
object_parents = defaultdict(set)
for o in repo.git.rev_list('--objects', '-g', '--no-walk', '--all').split('\n'):
name = o.split()[0]
obj = name_to_object(repo, name)
print(obj.hexsha)
if type(obj) == git.objects.tree.Tree:
for b in obj.blobs:
object_parents[b.binsha].add(obj.binsha)
for t in obj.trees:
object_parents[t.binsha].add(obj.binsha)
elif type(obj) == git.objects.commit.Commit:
object_parents[obj.tree.binsha].add(obj.binsha)
Output:
File "/home/al/Source/gitxref/gitxref/__main__.py", line 37, in main
for b in obj.blobs:
File "/usr/lib/python3/dist-packages/gitdb/util.py", line 258, in __getattr__
return object.__getattribute__(self, attr)
File "/usr/lib/python3/dist-packages/git/objects/tree.py", line 263, in blobs
return [i for i in self if i.type == "blob"]
File "/usr/lib/python3/dist-packages/git/objects/tree.py", line 263, in <listcomp>
return [i for i in self if i.type == "blob"]
File "/usr/lib/python3/dist-packages/git/objects/tree.py", line 207, in _iter_convert_to_object
path = join_path(self.path, name)
File "/usr/lib/python3/dist-packages/gitdb/util.py", line 256, in __getattr__
self._set_cache_(attr)
File "/usr/lib/python3/dist-packages/git/objects/tree.py", line 200, in _set_cache_
super(Tree, self)._set_cache_(attr)
File "/usr/lib/python3/dist-packages/git/objects/base.py", line 164, in _set_cache_
% (attr, type(self).__name__))
AttributeError: Attribute 'path' unset: path and mode attributes must have been set during Tree object creation
The same thing happens with obj.trees.
version: 2.1.8-1/python3.6
Thanks for letting us know!
name_to_object is used internally, and seeing the very specific error message, this case is anticipated.
I wonder if it works if name_to_object is avoided, and rev_parse is used instead? The name_to_object function is never used directly, but only from rev_parse.
It does not work with rev_parse either. It raises the same exception.
As a workaround, I simply set a fake path on the object like this:
obj = name_to_object(...)
obj.path='unknown'
for b in obj.blobs:
...
This has no effect on the result. The actual code is like this: https://github.com/ali1234/gitxref/blob/8bced542d6d60493d9fdd5a8e4b27402c79741eb/gitxref/backrefs.py#L96
If it's not clear from the above code, what I am trying to do is iterate over every commit and tree in the repo and for each, return a list of binsha of subtrees, and for trees, also a list of binsha of blobs. The order does not matter and the paths do not matter. Speed is very important - this operation takes 15 minutes on a kernel tree when using 8 processes. Perhaps there is a way to do this without having to look up each hexsha? Maybe by using gitdb directly?
I see, so it does appear this bug is inherent all of GitPython, unless of course the code-path GitPython takes right after calling rev_parse applies a similar fix.
For the fastest possible access, you could try using a GitCmdObjectDB. Under the hood, when accessing objects, it will use a persistent instance of git cat-file, which gets fed the SHAs you want information of. I would assume this is the fastest way possible as most work is offloaded to cgit.
My version of GitPython appears to use GitCmdObjectDB by default. In fact my program does not work correctly if I tell it to use GitDB - it crashes with "unexpected delta opcode 0", possibly related to the other bug I reported.
It looks like by now GitPython is falling apart :D!
The GitDb implementation is written in pure-python, but wasn't adjusted
in a long time. Probably by now it simply doesn't understand modern
repositories anymore. Another reason for the GitCmdObjectDB being the
default for some time now.
Besides that, I wouldn't know how to make the aforementioned process
faster, except for maybe writing the hot portions in Rust (with
https://docs.rs/git2/0.7.1/git2/) or using the respective python bindings.
On Wed, Jun 6, 2018 at 7:35 AM Alistair Buxton [email protected] wrote:
My version of GitPython appears to use GitCmdObjectDB by default. In fact my program does not work correctly if I tell it to use GitDB - it crashes with "unexpected delta opcode 0", possibly related to the other bug I reported.
— You are receiving this because you commented.
Reply to this email directly, view it on GitHub https://github.com/gitpython-developers/GitPython/issues/759#issuecomment-394945731, or mute the thread https://github.com/notifications/unsubscribe-auth/AAD4hpmiP5OdaecgIqaAuER8C19lNo2oks5t52o5gaJpZM4UPJiu .
For the fun of it, I have created a small program which for now only effectively counts commits: https://github.com/Byron/git-count . It now uses the odb for iteration, and seems to produce acceptable results.
You can run it with cargo run.