rugged icon indicating copy to clipboard operation
rugged copied to clipboard

Tag strings without a known encoding as ASCII-8BIT.

Open arthurschreiber opened this issue 9 years ago • 7 comments

Almost all of the string used inside a git repository are encoding unaware.

This includes things like refnames, commit metadata, path names, and a lot more. The exception is commit messages, which can optionally be tagged with an encoding through a header in the commit metadata. We should only tag strings with an encoding if we know the exact encoding, and otherwise tag them as ASCII 8-BIT (binary).

arthurschreiber avatar Jul 08 '16 12:07 arthurschreiber

US-ASCII 8-BIT

Just to be clear, US-ASCII and ASCII-8BIT are totally different. US-ASCII is a known encoding, where ASCII-8BIT means "we don't know what the encoding of this string is". All strings should have an encoding, ASCII-8BIT just means that we don't know what the encoding is (or that it's truly binary data like an image or something).

tenderlove avatar Jul 08 '16 16:07 tenderlove

I was thinking about this last night. I thought it might be nice if we could configure a repository with encodings. Something like this:

repo = Rugged::Repository.new(ARGV[0], path_encoding: ::Encoding::UTF_8)

That way if you're dealing with a repository where you know the encoding of the paths, you can just configure the repo with the encoding to use. We could provide options for the bits of data where we can't know the encoding like paths and tags (those are the only two I can think of, blobs should always be binary).

If we add this configuration value, then we can default it to UTF-8 in order to maintain backwards compatibility.

tenderlove avatar Jul 08 '16 16:07 tenderlove

Just to be clear, US-ASCII and ASCII-8BIT are totally different.

Whoops, you're right, I mixed that up! 😄

arthurschreiber avatar Jul 11 '16 08:07 arthurschreiber

I thought it might be nice if we could configure a repository with encodings.

I'm not sure this is a good solution. If you have different people that work on a repo, they might have set different encodings on their machines. I don't think that's too uncommon, especially for people working on windows. 😞

arthurschreiber avatar Jul 11 '16 08:07 arthurschreiber

If you have different people that work on a repo, they might have set different encodings on their machines. I don't think that's too uncommon, especially for people working on windows. 😞

On Windows, at least, NTFS stores filenames as UTF-16 and there is, AFAIK, no way to use any other encoding. This is fortunate, since the Windows Git implementations do an insane amount of UTF16 <-> UTF8 conversion to be able to talk to the Windows APIs, since they all speak UTF-16 (or UCS-2 in some cases. Yay!)

Similarly, HFS+ will use a canonically decomposed UTF-16.

I'm just providing data here, I still think that returning these as ASCII-8BIT probably still makes a lot of sense.

ethomson avatar Jul 11 '16 13:07 ethomson

@tenderlove Is this going to break anything if we would push this into production as-is?

arthurschreiber avatar Aug 18 '16 11:08 arthurschreiber

I think it will work but we should test. I think we're retagging them as binary in all places so this should be OK

Aaron Patterson http://tenderlovemaking.com/ I'm on an iPhone so I apologize for top posting.

On Aug 18, 2016, at 7:48 AM, Arthur Schreiber [email protected] wrote:

@tenderlove Is this going to break anything if we would push this into production as-is?

― You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

tenderlove avatar Aug 18 '16 13:08 tenderlove