GitPython icon indicating copy to clipboard operation
GitPython copied to clipboard

Fix UnicodeDecodeError when reading packed-refs with non-UTF8 characters

Open MirrorDNA-Reflection-Protocol opened this issue 1 month ago • 0 comments

Summary

Fixes #2064

The packed-refs file can contain ref names that are not valid UTF-8 (e.g., Latin-1 encoded tag names created by older Git versions or systems with different locale settings). Previously, GitPython would fail with UnicodeDecodeError when reading such files.

Reproduction

As described in #2064:

git clone https://github.com/ACRA/acra
cd acra
python -c 'import git; print(git.Repo(".").tags)'

Before fix:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc3 in position 6216: invalid continuation byte

After fix: Successfully reads all 101 tags.

Changes

  • Add errors='surrogateescape' to the open() call in _iter_packed_refs()
  • This allows reading files with arbitrary byte sequences while preserving valid UTF-8 as text
  • Add test that verifies non-UTF8 packed-refs can be read successfully

Technical Details

The surrogateescape error handler is Python's standard approach for handling potentially non-UTF8 data in filesystem operations. It:

  • Passes through valid UTF-8 unchanged
  • Converts invalid byte sequences to Unicode surrogate characters (\uDC80-\uDCFF)
  • Preserves the original bytes in a reversible way (can be re-encoded back to original bytes)

This is the same approach used by Python's os.fsdecode() and is recommended for filesystem operations where encoding may be unknown or mixed.