libE57Format UTF-8 File Name test failure on Windows

One test that I added (ReadUmlautFileName) failed on both Linux & Windows. Both fail with:

/home/runner/work/libE57Format/libE57Format/test/src/test_SimpleReader.cpp:60: Failure Failed errno=2 error='No such file or directory' fileName=/home/runner/work/libE57Format/libE57Format/libE57Format-test-data/self/test filename äöü.e57 flags=0 mode=0

The test simply tries to read a file with a name containing umlauts - test filename äöü.e57.

I tried doing some CI-debugging, but I could not find the source of the problem.

Notes:

the file name is encoded in the test source code as test filename \x61\xcc\x88\x6f\xcc\x88\x75\xcc\x88.e57 to avoid potential problems with editors
I encoded it using a utf8 encoder online tool
I used the same technique for the Chinese filename (ReadChineseFileName) and that one works on all platforms
I verified that the file is actually there and that the name looks correct

I have not done any work with UTF-8 filenames on Linux or Windows, so I'm not sure where to look for a solution. Hopefully it's something obvious I'm overlooking.

To try it out, uncomment the ReadUmlautFileName test in test/src/test_SimpleReader.cpp.

Oct 14 '22 15:10 asmaloney

I took another stab at it (see build). Fixes the Linux test, but now breaks the Windows one when reading the Chinese file name 🤦

All my reading about this points at the solution that we have in place for Windows to convert UTF-8 to UTF-16, but that doesn't seem to work (or the UTF-8 encoding we're passing using a u8 string literal is not what it's expecting?). (Also see #26.)

If any Windows people have any insights it would be appreciated.

Oct 19 '22 22:10 asmaloney

I think the problem is probably handling of the string literal by MSVC. Are you setting /utf-8 when building the test? That should fix handling of string literals by MSVC. The file itself should be UTF-8 as well, but I believe that is already the case. Using u8"" instead of "" should then be optional.

https://learn.microsoft.com/en-us/cpp/build/reference/utf-8-set-source-and-executable-character-sets-to-utf-8?view=msvc-170

Oct 31 '22 16:10 ptc-jhoerner

I have checked the code and the actual handling of path name should be correct as far as I can see in both win and posix paths.

Oct 31 '22 16:10 ptc-jhoerner

Are you setting /utf-8 when building the test?

Nope - I'm letting the CI do whatever the CI does with Windows 😆

Seems crazy that it requires a command-line option, but 🤷 This is good though. Thanks for the info!

Do you have a recommendation on how to best treat UTF8 characters in code files? Include it directly (like my second attempt - my preference from a readability standpoint) or \x (or \u ?) encode it as I do in the current tests.

Oct 31 '22 16:10 asmaloney

I think it is better to use them directly. One just need to make sure that:

Files are UTF-8 encoded
/utf-8 is set for MSVC

Then it works across macOS, Linux and Windows and MSVC, clang, gcc.

u8"" vs "" does not matter until c++20, with c++20 u8"" will start being char8_t[] and this will break the test. char8_t has been matter of lot of discussion. char8_t in C++20 has a lot of surprises e.g. std::cout << u8"hello" etc. My take on this is to ignore char8_t and adopt char* == UTF-8 string policy, like this library already does. I found this setup to work the best, especially as one can now configure windows to follow this policy as well. Linux and macOS has been doing this for years already.

Oct 31 '22 17:10 ptc-jhoerner

This sounds reasonable.

I will revive these tests to get that working.

Thanks Jiri!

Oct 31 '22 17:10 asmaloney

...and this is the magic to make it work:


target_compile_options( testE57
	PUBLIC
		$<$<C_COMPILER_ID:MSVC>:/utf-8>
		$<$<CXX_COMPILER_ID:MSVC>:/utf-8>
)

Excellent!

Oct 31 '22 19:10 asmaloney