GitPython icon indicating copy to clipboard operation
GitPython copied to clipboard

"Attribute 'path' unset" when accessing the blobs/trees in a tree object returned by git.repo.fun.name_to_object

Open ali1234 opened this issue 7 years ago • 8 comments

Example code:

object_parents = defaultdict(set)

for o in repo.git.rev_list('--objects', '-g', '--no-walk', '--all').split('\n'):
    name = o.split()[0]
    obj = name_to_object(repo, name)
    print(obj.hexsha)
    if type(obj) == git.objects.tree.Tree:
	for b in obj.blobs:
	    object_parents[b.binsha].add(obj.binsha)
	for t in obj.trees:
	    object_parents[t.binsha].add(obj.binsha)
    elif type(obj) == git.objects.commit.Commit:
	object_parents[obj.tree.binsha].add(obj.binsha)

Output:

  File "/home/al/Source/gitxref/gitxref/__main__.py", line 37, in main
    for b in obj.blobs:
  File "/usr/lib/python3/dist-packages/gitdb/util.py", line 258, in __getattr__
    return object.__getattribute__(self, attr)
  File "/usr/lib/python3/dist-packages/git/objects/tree.py", line 263, in blobs
    return [i for i in self if i.type == "blob"]
  File "/usr/lib/python3/dist-packages/git/objects/tree.py", line 263, in <listcomp>
    return [i for i in self if i.type == "blob"]
  File "/usr/lib/python3/dist-packages/git/objects/tree.py", line 207, in _iter_convert_to_object
    path = join_path(self.path, name)
  File "/usr/lib/python3/dist-packages/gitdb/util.py", line 256, in __getattr__
    self._set_cache_(attr)
  File "/usr/lib/python3/dist-packages/git/objects/tree.py", line 200, in _set_cache_
    super(Tree, self)._set_cache_(attr)
  File "/usr/lib/python3/dist-packages/git/objects/base.py", line 164, in _set_cache_
    % (attr, type(self).__name__))
AttributeError: Attribute 'path' unset: path and mode attributes must have been set during Tree object creation

The same thing happens with obj.trees.

ali1234 avatar May 27 '18 05:05 ali1234

version: 2.1.8-1/python3.6

ali1234 avatar May 27 '18 05:05 ali1234

Thanks for letting us know!

name_to_object is used internally, and seeing the very specific error message, this case is anticipated. I wonder if it works if name_to_object is avoided, and rev_parse is used instead? The name_to_object function is never used directly, but only from rev_parse.

Byron avatar Jun 05 '18 11:06 Byron

It does not work with rev_parse either. It raises the same exception.

As a workaround, I simply set a fake path on the object like this:

obj = name_to_object(...)
obj.path='unknown'
for b in obj.blobs:
    ...

This has no effect on the result. The actual code is like this: https://github.com/ali1234/gitxref/blob/8bced542d6d60493d9fdd5a8e4b27402c79741eb/gitxref/backrefs.py#L96

ali1234 avatar Jun 05 '18 16:06 ali1234

If it's not clear from the above code, what I am trying to do is iterate over every commit and tree in the repo and for each, return a list of binsha of subtrees, and for trees, also a list of binsha of blobs. The order does not matter and the paths do not matter. Speed is very important - this operation takes 15 minutes on a kernel tree when using 8 processes. Perhaps there is a way to do this without having to look up each hexsha? Maybe by using gitdb directly?

ali1234 avatar Jun 05 '18 16:06 ali1234

I see, so it does appear this bug is inherent all of GitPython, unless of course the code-path GitPython takes right after calling rev_parse applies a similar fix.

For the fastest possible access, you could try using a GitCmdObjectDB. Under the hood, when accessing objects, it will use a persistent instance of git cat-file, which gets fed the SHAs you want information of. I would assume this is the fastest way possible as most work is offloaded to cgit.

Byron avatar Jun 06 '18 05:06 Byron

My version of GitPython appears to use GitCmdObjectDB by default. In fact my program does not work correctly if I tell it to use GitDB - it crashes with "unexpected delta opcode 0", possibly related to the other bug I reported.

ali1234 avatar Jun 06 '18 05:06 ali1234

It looks like by now GitPython is falling apart :D! The GitDb implementation is written in pure-python, but wasn't adjusted in a long time. Probably by now it simply doesn't understand modern repositories anymore. Another reason for the GitCmdObjectDB being the default for some time now. Besides that, I wouldn't know how to make the aforementioned process faster, except for maybe writing the hot portions in Rust (with https://docs.rs/git2/0.7.1/git2/) or using the respective python bindings.

On Wed, Jun 6, 2018 at 7:35 AM Alistair Buxton [email protected] wrote:

My version of GitPython appears to use GitCmdObjectDB by default. In fact my program does not work correctly if I tell it to use GitDB - it crashes with "unexpected delta opcode 0", possibly related to the other bug I reported.

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/gitpython-developers/GitPython/issues/759#issuecomment-394945731, or mute the thread https://github.com/notifications/unsubscribe-auth/AAD4hpmiP5OdaecgIqaAuER8C19lNo2oks5t52o5gaJpZM4UPJiu .

Byron avatar Jun 06 '18 06:06 Byron

For the fun of it, I have created a small program which for now only effectively counts commits: https://github.com/Byron/git-count . It now uses the odb for iteration, and seems to produce acceptable results. You can run it with cargo run.

Byron avatar Jun 06 '18 06:06 Byron