PowerShellForGitHub icon indicating copy to clipboard operation
PowerShellForGitHub copied to clipboard

PowerShellForGitHub.log File Character Encoding Differs Between PowerShell 5 and 7

Open X-Guardian opened this issue 5 years ago • 10 comments

Issue Details

The PowerShellForGitHub.log is written with different character encoding depending upon whether PowerShell 5 or PowerShell 7 are used. PowerShell 5 encoding is UTF-16 LE and PowerShell 7 is UTF-8. This causes an issue if the PowerShellForGithub module is used on both PowerShell 5 and PowerShell 7 on the same computer, and corrupted entries in the log will occur.

Steps to reproduce the issue

$PowerShellForGitHubLogPath = "$([Environment]::GetFolderPath('MyDocuments'))\PowerShellForGitHub.log" 
Remove-Item -Path $PowerShellForGitHubLogPath
pwsh.exe -Command { Get-GitHubRepository -OwnerName 'Octocat' | Out-Null }
powershell.exe -Command { Get-GitHubRepository -OwnerName 'Octocat' | Out-Null }
Get-Content -Path $PowerShellForGitHubLogPath 

Review

Verbose logs showing the problem

2020-06-14 12:18:45 : Simon : VERBOSE : [0.14.0] Executing: Get-GitHubRepository -OwnerName "Octocat"
2020-06-14 12:18:45 : Simon : VERBOSE : Restoring Access Token from file.  This value can be cleared in the future by calling Clear-GitHubAuthentication.
2020-06-14 12:18:45 : Simon : VERBOSE : Getting repos for Octocat
2020-06-14 12:18:45 : Simon : VERBOSE : Accessing [Get] https://api.github.com/users/Octocat/repos? [Timeout = 0)]
2020-06-14 12:18:46 : Simon : VERBOSE : Telemetry has been disabled via configuration. Skipping reporting event [Get-GitHubRepository].
2 0 2 0 - 0 6 - 1 4   1 2 : 1 8 : 4 8   :   S i m o n   :   V E R B O S E   :   [ 0 . 1 4 . 0 ]   E x e c u t i n g :   G e t - G i t H u b R e p o s i t o r y   - O w n e r N a m e   " O c t o c a t "

 2 0 2 0 - 0 6 - 1 4   1 2 : 1 8 : 4 8   :   S i m o n   :   V E R B O S E   :   R e s t o r i n g   A c c e s s   T o k e n   f r o m   f i l e .     T h i s   v a l u e   c a n   b e   c l e a r e d   i n   t h e   f u t u r e   b y   c a l l i n g   C l e a r - G i t H u b A u 
t h e n t i c a t i o n .

 2 0 2 0 - 0 6 - 1 4   1 2 : 1 8 : 4 8   :   S i m o n   :   V E R B O S E   :   G e t t i n g   r e p o s   f o r   O c t o c a t

 2 0 2 0 - 0 6 - 1 4   1 2 : 1 8 : 4 8   :   S i m o n   :   V E R B O S E   :   A c c e s s i n g   [ G e t ]   h t t p s : / / a p i . g i t h u b . c o m / u s e r s / O c t o c a t / r e p o s ?   [ T i m e o u t   =   0 ) ]

 2 0 2 0 - 0 6 - 1 4   1 2 : 1 8 : 4 9   :   S i m o n   :   V E R B O S E   :   T e l e m e t r y   h a s   b e e n   d i s a b l e d   v i a   c o n f i g u r a t i o n .   S k i p p i n g   r e p o r t i n g   e v e n t   [ G e t - G i t H u b R e p o s i t o r y ] .

Suggested solution to the issue

Add -Encoding ascii to the Out-File cmdlet call in the Write-Log function of the Helpers.ps1 module. This will force the log output to be UTF-8 without BOM on all PowerShell versions.

Note, on PowerShell 5 and below, any current log will have been written using UTF-16 LE, so this log would need to be dealt with when making this change (backed up, converted, deleted ?)

Requested Assignment

  • I'm just reporting this problem, but don't want to fix it.

Operating System

OsName               : Microsoft Windows 10 Pro
OsOperatingSystemSKU : 48
OsArchitecture       : 64-bit
WindowsVersion       : 1903
WindowsBuildLabEx    : 18362.1.amd64fre.19h1_release.190318-1202
OsLanguage           : en-GB
OsMuiLanguages       : {en-GB, en-US}

PowerShell Version

PowerShell 5
Name                           Value
----                           -----
PSVersion                      5.1.18362.752
PSEdition                      Desktop
PSCompatibleVersions           {1.0, 2.0, 3.0, 4.0...}
BuildVersion                   10.0.18362.752
CLRVersion                     4.0.30319.42000
WSManStackVersion              3.0
PSRemotingProtocolVersion      2.3
SerializationVersion           1.1.0.1
PowerShell 7
Name                           Value
----                           -----
PSVersion                      7.0.1
PSEdition                      Core
GitCommitId                    7.0.1
OS                             Microsoft Windows 10.0.18362
Platform                       Win32NT
PSCompatibleVersions           {1.0, 2.0, 3.0, 4.0…}
PSRemotingProtocolVersion      2.3
SerializationVersion           1.1.0.1
WSManStackVersion              3.0

Module Version

Running: 0.14.0 Installed: 0.14.0

X-Guardian avatar Jun 14 '20 11:06 X-Guardian

Thanks for catching this!

I think I had noticed it in passing and was wondering what was going on here, but not enough to look into it any further.

Given that Unicode characters can be used with the GitHub API, it makes more sense to use utf8 vs ansi, but otherwise I agree that it was an oversight that an explicit encoding format wasn't specified in that function.

HowardWolosky avatar Jun 15 '20 02:06 HowardWolosky

Is this up for grabs?

giuseppecampanelli avatar Jun 16 '20 03:06 giuseppecampanelli

Is this up for grabs?

Go for it. To be clear, have it use utf8.

HowardWolosky avatar Jun 16 '20 04:06 HowardWolosky

Actually, play with it a bit. I'm still seeing oddities using utf8 in PS5. You might want to try unicode instead. Do some testing with the different encoding formats across both PS5 and PS7 to make sure that logging from commandline usage as well as from Pester all output the same without any unexpected characters.

HowardWolosky avatar Jun 16 '20 05:06 HowardWolosky

Thanks and will do!

giuseppecampanelli avatar Jun 16 '20 10:06 giuseppecampanelli

@HowardWolosky I've tested both Unicode and UTF8 encodings and both seem to solve the issue. The issue was caused due to the fact that Out-File in PS5 has a default encoding of Unicode and Out-File in PS7 has a default encoding of UTF8NoBOM so they were mismatched.

I couldn't notice any issues with either, what were the oddities you noticed with UTF8? I tested outputs from both executing the command directly and from Pester.

giuseppecampanelli avatar Jun 18 '20 01:06 giuseppecampanelli

It looks like the issue I saw was related to what @X-Guardian mentioned in his initial post:

Note, on PowerShell 5 and below, any current log will have been written using UTF-16 LE, so this log would need to be dealt with when making this change (backed up, converted, deleted ?)

When I left my existing file as-is bit changed it to append using UTF-8, this is what I would find at the end of the file when new content got added after the change:

badEncoding

If I deleted the file entirely, the problem went away when it got re-created.

I think what we need to do:

Add a new method to Helpers.ps1. Something like Repair-LogFile.

  • It will make use of the idea from here to read the first 3 bytes of the existing log file (if it exists). If the bytes aren't @(239, 187, 191) (e.g. EF. BB. BF) then it's the wrong encoding.
  • In that scenario, we should move/backup the file and write a warning to the user to let them know that it happened.
$logPath = Get-Item -Path (Get-GitHubConfiguration -Name LogPath)
$backupFileName = $logPath.BaseName + ".backup" + $logPath.Extension
$backupPath = (Join-Path -Path $logPath.Directory -ChildPath $backupFileName)
Move-Item -Path $logPath -backupPath
Write-Log -Message "Log file encoding format has changed.  Old file has been moved to [$backupPath]." -Level Warning

Then I think a new script file needs to be added to the module (something like OnModuleLoad.ps1) and then have that added to ScriptsToProcess in the manifest. Have OnModuleLoad.ps1 call Repaid-LogFile. Then, the log check and migration will happen immediately upon being loaded.

Make sense?

HowardWolosky avatar Jun 21 '20 02:06 HowardWolosky

How about just converting the file, it is then transparent to the user? Do the same byte check then read the file and write back out with the required encoding if required.

I wouldn't use -Encoding utf8 as that add a BOM to the file. -Encoding ascii writes UTF8 without a BOM.

X-Guardian avatar Jun 21 '20 10:06 X-Guardian

@X-Guardian wouldn't ASCII be an issue since the GitHub API can use Unicode characters?

giuseppecampanelli avatar Jun 21 '20 20:06 giuseppecampanelli

How about just converting the file, it is then transparent to the user? Do the same byte check then read the file and write back out with the required encoding if required.

Generally speaking, I like that idea. I think a potential problem with it (albeit a one-time only problem) is if the file is substantially large (because it's never been pruned), it could take a while to read in and read out, which will silently block the user's console window with no clear explanation.

According to this article, when writing the file back out, we should use StreamWriter to achieve the fastest performance (tip of the hat to StackOverflow for that link.

Similarly, it appears that when trying to read the whole file in, you should probably use [System.IO.File]::ReadLines().

I wouldn't use -Encoding utf8 as that add a BOM to the file. -Encoding ascii writes UTF8 without a BOM.

I've just finished reading through lots of opinions on the BOM vs no BOM argument. I see the merit of avoiding the BOM, but I see no way of doing so while sticking strictly with PowerShell given the encoding options that v4/5 have (as opposed to 7 which does have a no BOM option).

Luckily, it looks like our preferred way of writing, based on that provided article, does provide a utf8 option that excludes the BOM....

$path = "c:\users\howard\desktop\test.txt"
$stream = [System.IO.StreamWriter]::new($path, [System.Text.Encoding]::UTF8)
$stream.Write('')
$stream.Close()

# This results in a 0-byte file which is also confirmed by:
Get-Content -Encoding Byte -ReadCount 3 -Path $path
# Returns nothing

# If we try again with a single character:
$stream = [System.IO.StreamWriter]::new($path, [System.Text.Encoding]::UTF8)
$stream.Write('a')
$stream.Close()
Get-Content -Encoding Byte -ReadCount 3 -Path $path

# That returns back 97, which is the letter 'a': https://www.codetable.net/decimal/97

Given this, I think the path forward is to have the Repair-LogFile method as I previously described. It should detect the first two bytes of the file. If they are @(255, 254), that means that it's the UTF-16 LE encoding from PS4/5. That's when you would read in the file with [System.IO.File]::ReadLines() and then write the whole content out again with StreamWriter.

Then you would also update Write-Log to use StreamWriter.WriteLine() to append the log file entry moving forward, as opposed to Out-File -Append. That way, we can ensure uniform encoding across all PowerShell versions moving forward.

@X-Guardian -- are you seeing issues with this plan?

HowardWolosky avatar Jun 21 '20 23:06 HowardWolosky