<locale>: Empty locale name and UTF-8 issues
Describe the bug This was reported at Developer Community where several issues were mentioned.
- Issue number 1. The locale name shouldn't be empty according to the reporter of these issues. I don't know what's correct but other compilers returns "C".
- Issue number 2. Microsoft's implementation throws "bad locale name" for valid locales that ends with ".utf8" or "UTF-8". According to the original reporter, issue number 2 was a regression in your libraries and worked in version 15.8.0,
- Issue number 3. codecvt_byname::in is reported to return an error and so the unicode conversion fails. I couldn't reproduce this, but the result from FromNarrowString doesn't compare equal with wideStr if you initialise deletable_facet with "en_US.UTF-8", so the test still fails.
Command-line test case
C:\Temp>type repro.cpp
#include <iostream>
#include <locale>
#include <type_traits>
#include <exception>
using namespace std;
namespace
{
struct deletable_facet : public codecvt_byname<wchar_t, char, mbstate_t>
{
deletable_facet(const std::string& name) : codecvt_byname<wchar_t, char, mbstate_t>(name.c_str()) { }
~deletable_facet() = default;
};
}
wstring FromNarrowString(const char* from, const char* to, const locale& l)
{
const deletable_facet cvt{ l.name() };
mbstate_t mbstate{};
const size_t externalSize = to - from;
wstring resultWStr(externalSize, '\0');
const char* from_next; wchar_t* to_next;
// Issue number 3, cvt.in returns an error.
codecvt_base::result result = cvt.in(mbstate, from, to, from_next, &resultWStr[0], &resultWStr[resultWStr.size()], to_next);
if (result != codecvt_base::ok)
{
throw std::runtime_error("Error converting locale multibyte string to UNICODE");
}
resultWStr.resize(to_next - &resultWStr[0]);
return resultWStr;
}
int main()
{
// Issue number 1. The locale name should not be empty according to the reporter. Other compilers returns "C".
if (std::locale("").name().empty())
{
std::cout << "locale name should not be empty\n";
return -1;
}
// Issue number 2. Microsoft's STL throws "bad locale name" for valid locales that ends with ".utf8" or "UTF-8".
for (const char* localName : { "en_US.utf8", "en_US.UTF-8" })
{
try
{
const string localMBString = "\x7a\xc3\x9f\xe6\xb0\xb4\xf0\x9d\x84\x8b";
const wstring wideStr = L"zß水𝄋";
locale l{ localName };
if (FromNarrowString(localMBString.c_str(), localMBString.c_str() + localMBString.length(), l) != wideStr)
return -1;
}
catch (const std::exception& ex)
{
std::cout << ex.what() << std::endl;
return -1;
}
}
std::cout << "success\n";
return 0;
}
C:\temp>cl /EHsc /W4 /WX .\repro.cpp
Microsoft (R) C/C++ Optimizing Compiler Version 19.27.29009.1 for x64
Copyright (C) Microsoft Corporation. All rights reserved.
repro.cpp
Microsoft (R) Incremental Linker Version 14.27.29009.1
Copyright (C) Microsoft Corporation. All rights reserved.
/out:repro.exe
repro.obj
C:\temp>repro
locale name should not be empty
Expected behavior It should print success. That's what GCC does. Click to run with compiler explorer
STL version Microsoft Visual Studio Community 2019 Preview Version 16.7.0 Preview 3.1
Additional context Also tracked by DevCom-330322 and VSO-679264 / AB#679264."
Issue number 2. Microsoft's implementation throws "bad locale name" for valid locales that ends with ".utf8" or "UTF-8". According to the original reporter, issue number 2 was a regression in your libraries and worked in version 15.8.0,
I believe this never worked, we just got better about reporting the error. I know there were cases before where we didn't correctly report the error from the underlying call to setlocale, so you didn't get UTF-8, you got whatever the default encoding was. If I understand correctly our CRT only supports UTF-8 behind compatibility flags / manifests but I don't know the exact details.
Issue number 3. codecvt_byname::in is reported to return an error and so the unicode conversion fails.
I don't believe we've ever had support for UTF-8 <-> UTF-16 in codecvt_byname. UTF-8 <-> UCS-2 might work. (Note that a correct UTF-8 <-> UTF-16 facet cannot be used with fstream)
It should print success. That's what GCC does.
GCC's standard library has UTF-8 support as a first class thing, ours does not.
other compilers returns "C".
they don't, at least clang with libc++ also returns empty
https://gcc.godbolt.org/z/hnTrbc
But I agree that we should not return empty locale name.
We weren't able to merge @Agrael1's PR (see @barcharcraz's review there), but one change may be worth looking into. The following lines might be possible to remove: https://github.com/microsoft/STL/blob/ea092540b4a3347947863a447d820a9905a68c5b/stl/inc/xlocale#L296-L297 However, further investigation is required to fully understand (1) what these lines were attempting to do, and (2) what the behavioral impact of removing them would be. We believe this code has been functionally unchanged for ~20 years, and we don't understand all of its interactions.
Are there any plans to fix the empty locale name issue?