GitPython
GitPython copied to clipboard
Fix UnicodeDecodeError when reading packed-refs with non-UTF8 characters
Summary
Fixes #2064
The packed-refs file can contain ref names that are not valid UTF-8 (e.g., Latin-1 encoded tag names created by older Git versions or systems with different locale settings). Previously, GitPython would fail with UnicodeDecodeError when reading such files.
Reproduction
As described in #2064:
git clone https://github.com/ACRA/acra
cd acra
python -c 'import git; print(git.Repo(".").tags)'
Before fix:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc3 in position 6216: invalid continuation byte
After fix: Successfully reads all 101 tags.
Changes
- Add
errors='surrogateescape'to theopen()call in_iter_packed_refs() - This allows reading files with arbitrary byte sequences while preserving valid UTF-8 as text
- Add test that verifies non-UTF8 packed-refs can be read successfully
Technical Details
The surrogateescape error handler is Python's standard approach for handling potentially non-UTF8 data in filesystem operations. It:
- Passes through valid UTF-8 unchanged
- Converts invalid byte sequences to Unicode surrogate characters (
\uDC80-\uDCFF) - Preserves the original bytes in a reversible way (can be re-encoded back to original bytes)
This is the same approach used by Python's os.fsdecode() and is recommended for filesystem operations where encoding may be unknown or mixed.