UTF-8 File Name test failure on Windows
One test that I added (ReadUmlautFileName) failed on both Linux & Windows. Both fail with:
/home/runner/work/libE57Format/libE57Format/test/src/test_SimpleReader.cpp:60: Failure Failed errno=2 error='No such file or directory' fileName=/home/runner/work/libE57Format/libE57Format/libE57Format-test-data/self/test filename äöü.e57 flags=0 mode=0
The test simply tries to read a file with a name containing umlauts - test filename äöü.e57.
I tried doing some CI-debugging, but I could not find the source of the problem.
Notes:
- the file name is encoded in the test source code as
test filename \x61\xcc\x88\x6f\xcc\x88\x75\xcc\x88.e57to avoid potential problems with editors - I encoded it using a utf8 encoder online tool
- I used the same technique for the Chinese filename (
ReadChineseFileName) and that one works on all platforms - I verified that the file is actually there and that the name looks correct
I have not done any work with UTF-8 filenames on Linux or Windows, so I'm not sure where to look for a solution. Hopefully it's something obvious I'm overlooking.
To try it out, uncomment the ReadUmlautFileName test in test/src/test_SimpleReader.cpp.
I took another stab at it (see build). Fixes the Linux test, but now breaks the Windows one when reading the Chinese file name 🤦
All my reading about this points at the solution that we have in place for Windows to convert UTF-8 to UTF-16, but that doesn't seem to work (or the UTF-8 encoding we're passing using a u8 string literal is not what it's expecting?). (Also see #26.)
If any Windows people have any insights it would be appreciated.
I think the problem is probably handling of the string literal by MSVC. Are you setting /utf-8 when building the test? That should fix handling of string literals by MSVC. The file itself should be UTF-8 as well, but I believe that is already the case. Using u8"" instead of "" should then be optional.
https://learn.microsoft.com/en-us/cpp/build/reference/utf-8-set-source-and-executable-character-sets-to-utf-8?view=msvc-170
I have checked the code and the actual handling of path name should be correct as far as I can see in both win and posix paths.
Are you setting /utf-8 when building the test?
Nope - I'm letting the CI do whatever the CI does with Windows 😆
Seems crazy that it requires a command-line option, but 🤷 This is good though. Thanks for the info!
Do you have a recommendation on how to best treat UTF8 characters in code files? Include it directly (like my second attempt - my preference from a readability standpoint) or \x (or \u ?) encode it as I do in the current tests.
I think it is better to use them directly. One just need to make sure that:
- Files are UTF-8 encoded
-
/utf-8is set for MSVC
Then it works across macOS, Linux and Windows and MSVC, clang, gcc.
u8"" vs "" does not matter until c++20, with c++20 u8"" will start being char8_t[] and this will break the test. char8_t has been matter of lot of discussion. char8_t in C++20 has a lot of surprises e.g. std::cout << u8"hello" etc. My take on this is to ignore char8_t and adopt char* == UTF-8 string policy, like this library already does. I found this setup to work the best, especially as one can now configure windows to follow this policy as well. Linux and macOS has been doing this for years already.
This sounds reasonable.
I will revive these tests to get that working.
Thanks Jiri!
...and this is the magic to make it work:
target_compile_options( testE57
PUBLIC
$<$<C_COMPILER_ID:MSVC>:/utf-8>
$<$<CXX_COMPILER_ID:MSVC>:/utf-8>
)
Excellent!