Encoding issues (known)
There are some problems with the encoding, and people should use non UTF-8 files for it to work correctly. I got the issue, that I am parsing lua binaries (.lub) files which I don't create so I can't just save them in another encoding. The problem is not only for UTF-8 but also for any other encodings but default like EUC-KR (cp949), Shift-JIS (cp932) or chinese (cp 936 and 950).
The problem is the ToString() function of the CharPtr.
public override string ToString()
{
string result = "";
for (int i = index; (i < chars.Length) && (chars[i] != '\0'); i++)
result += chars[i];
return result;
}
The line adding the character to the result string will break the encoding becuase char will only use characters in the users default windows encoding. This removed a lot of characters by ? in my case. I instead casted them to byte instead which keeps all characters.
// determine the size
int i;
for (i = index; (i < chars.Length) && (chars[i] != '\0'); i++)
{ }
// copy the data from the char array to the byte array
byte[] result = new byte[i];
for (int x = index; (x < i); x++)
{
result[x] = (byte)chars[x];
}
// return the encoded string
return Encoding.GetEncoding(1252).GetString(result);
This is not the best code to do that and I am not sure if GetEncoding(1252) is the correct one here. This returns the original string to the caller which can then be translated to a specific encoding:
»ç°ú => 사과
It seems like the default one encoding is 1251 which does not contain 0x82-0x9F + 0xAD (at least some of them are missing) and they get replaced with ? (0x3F) instead. Also converting the 1252 string to UTF-8 works then.