godot-cpp icon indicating copy to clipboard operation
godot-cpp copied to clipboard

String::String(const char*) is expecting latin1 instead of UTF-8

Open saki7 opened this issue 3 years ago • 7 comments

Since #695, String::String(const char*) is now internally calling string_new_with_latin1_chars instead of string_new_with_utf8_chars.

diff: https://github.com/godotengine/godot-cpp/commit/bf8fc4c53d07cf8ac4d8343364899c197076fbc2#diff-20a0dfd8072691f654c12adb6aecabee9d3247fb99488b37539dab1d967c97b8L77-L79

Because of this change, we can't pass non-latin1 characters for String-like parameters anymore. For example, node->get_node("Path/To/NonLatin1/名前") fails.

Technically speaking, char* is encoded in a native encoding, which is not guaranteed to be neither UTF-8 nor Latin-1 (by definition). Thus, we should not choose between latin1 and utf8 in the first place; we should add a new function string_new_with_native_chars and call it whenever char* is used as an input. But that's a very long story. For sake of simplicity, we can just expect UTF-8 for char*, because it is the default for many environments today (note that MSVC can be configured to use UTF-8 by the /utf-8 compiler flag).

No offense, but even if the change of the default encoding was intended, it should not be done in #695 because it is a breaking change. I consider this a regression.

saki7 avatar Jul 11 '22 18:07 saki7

Since https://github.com/godotengine/godot-cpp/pull/695, String::String(const char*) is now internally calling string_new_with_latin1_chars instead of string_new_with_utf8_chars.

This is intentional, it's the same behavior as the Godot String is using in the engine code. Since the same code can be used as both a built-in module and extension, the behavior should be identical.

a breaking change. I consider this a regression.

GDExtension API is not finalized and pretty much every change is breaking.

bruvzg avatar Jul 11 '22 18:07 bruvzg

@bruvzg :

This is intentional, it's the same behavior as the Godot String is using in the engine code. Since the same code can be used as both a built-in module and extension, the behavior should be identical.

Then engine core is wrong. As I described above, char* is not guaranteed to be latin1 in C++ language spec. In practice, a function which receives char* should successfully parse an arbitrary native encoding characters because that is what char* is designed for.

Even granting that we live with the inappropriate defaults, the only admissible choice is assuming char* as UTF-8, not Latin1. That would be acceptable because many programmers are now familiar with "assume every string as UTF-8 by default" ideology today.

Regarding my use case, the simplest way to use UTF-8 characters is now node->get_node(String::utf8("Path/To/NonLatin1/名前")). I really don't appreciate that form for several reasons:

  1. char* is not Latin1, so the encoding errors regarding Latin1 is nonsense in the first place
  2. It is very annoying to write String::utf8(...) everywhere
  3. Non-English programmers tend to mix Latin1 and non-Latin1 characters so the current design choice would lead to many unexpected mistakes. Imagine a piece of code which was originally node->get_node("Path/To/Name"), when you change it to node->get_node("Path/To/名前") then it fails but it will be very hard to come up with the solution (String::utf8(...)) from that error.

saki7 avatar Jul 11 '22 19:07 saki7

I agree that using UTF-8 is better, but this change should be done in both engine and gdextension at the same time (which will involve a lot of changes to the engine).

In practice, a function which receives char* should successfully parse an arbitrary native encoding characters because that is what char* is designed for.

Reliably detecting native char * encoding is impossible in most cases, so it definitely can't be expected.

char* is not Latin1, so the encoding errors regarding Latin1 is nonsense in the first place

The particular code is originally designed to only handle ASCII, without any checks, it's called "Latin-1" for the sole reason "Latin-1" characters are identical to the first 255 chars of Unicode.

Non-English programmers tend to mix Latin1 and non-Latin1 characters so the current design choice would lead to many unexpected mistakes.

This is valid concern, this kind of mistakes happened in the engine many times.

bruvzg avatar Jul 11 '22 19:07 bruvzg

Should we ping other committers? Since I'm not familiar with the organization I'm not sure how we could reach a consensus.

saki7 avatar Jul 11 '22 19:07 saki7

Should we ping other committers? Since I'm not familiar with the organization I'm not sure how we could reach a consensus.

I do not think it's something big enough to require a formal proposal (I'm pretty sure it was proposed multiple times before, and no one opposed it, but no one bother to implement it either).

Anyway, I have already added it to my TODO list.

bruvzg avatar Jul 11 '22 19:07 bruvzg

In practice, a function which receives char* should successfully parse an arbitrary native encoding characters because that is what char* is designed for.

Reliably detecting native char * encoding is impossible in most cases, so it definitely can't be expected.

Yes, if we are to extend godot core's String::copy_from(const char *p_cstr) then it would require a lot of rework. It's not impossible though; there exist some portable ways to convert between char* and other variants, for example std::mbrtoc8 (C++20) and ICU's UnicodeString::UnicodeString(const char*).

Ideally we should change the implementation to use the portable converters for char* but personally I'll be satisfied with UTF-8 being the default.

saki7 avatar Jul 11 '22 20:07 saki7

It's not impossible though; there exist some portable ways to convert between char* and other variants, for example std::mbrtoc8 (C++20) and ICU's UnicodeString::UnicodeString(const char*).

Functions like this rely on the current system locale, but char * is not necessary in the current locale (Godot uses MultiByteToWideChar to parse command line output from the started processes on Windows, since current locale is a good assumption for this case). Personally, I do not see any reason to implement support for non-Unicode encoding.

bruvzg avatar Jul 11 '22 20:07 bruvzg