STL icon indicating copy to clipboard operation
STL copied to clipboard

<locale>: Empty locale name and UTF-8 issues

Open Amaroker opened this issue 5 years ago • 4 comments

Describe the bug This was reported at Developer Community where several issues were mentioned.

  • Issue number 1. The locale name shouldn't be empty according to the reporter of these issues. I don't know what's correct but other compilers returns "C".
  • Issue number 2. Microsoft's implementation throws "bad locale name" for valid locales that ends with ".utf8" or "UTF-8". According to the original reporter, issue number 2 was a regression in your libraries and worked in version 15.8.0,
  • Issue number 3. codecvt_byname::in is reported to return an error and so the unicode conversion fails. I couldn't reproduce this, but the result from FromNarrowString doesn't compare equal with wideStr if you initialise deletable_facet with "en_US.UTF-8", so the test still fails.

Command-line test case

C:\Temp>type repro.cpp
#include <iostream>
#include <locale>
#include <type_traits>
#include <exception>

using namespace std;
namespace
{
    struct deletable_facet : public codecvt_byname<wchar_t, char, mbstate_t>
    {
        deletable_facet(const std::string& name) : codecvt_byname<wchar_t, char, mbstate_t>(name.c_str()) { }
        ~deletable_facet() = default;
    };
}

wstring FromNarrowString(const char* from, const char* to, const locale& l)
{
    const deletable_facet cvt{ l.name() };
    mbstate_t mbstate{};
    const size_t externalSize = to - from;
    wstring resultWStr(externalSize, '\0');
    const char* from_next; wchar_t* to_next;

    // Issue number 3, cvt.in returns an error.
    codecvt_base::result result = cvt.in(mbstate, from, to, from_next, &resultWStr[0], &resultWStr[resultWStr.size()], to_next);
    if (result != codecvt_base::ok)
    {
        throw std::runtime_error("Error converting locale multibyte string to UNICODE");
    }
    resultWStr.resize(to_next - &resultWStr[0]);
    return resultWStr;
}

int main()
{
    // Issue number 1. The locale name should not be empty according to the reporter. Other compilers returns "C".
    if (std::locale("").name().empty())
    {
        std::cout << "locale name should not be empty\n";
        return -1;
    }

    // Issue number 2. Microsoft's STL throws "bad locale name" for valid locales that ends with ".utf8" or "UTF-8".
    for (const char* localName : { "en_US.utf8", "en_US.UTF-8" })
    {
        try
        {
            const string localMBString = "\x7a\xc3\x9f\xe6\xb0\xb4\xf0\x9d\x84\x8b";
            const wstring wideStr = L"zß水𝄋";

            locale l{ localName };
            if (FromNarrowString(localMBString.c_str(), localMBString.c_str() + localMBString.length(), l) != wideStr)
                return -1;
        }
        catch (const std::exception& ex)
        {
            std::cout << ex.what() << std::endl;
            return -1;
        }
    }

    std::cout << "success\n";
    return 0;
}

C:\temp>cl /EHsc /W4 /WX .\repro.cpp
Microsoft (R) C/C++ Optimizing Compiler Version 19.27.29009.1 for x64
Copyright (C) Microsoft Corporation.  All rights reserved.

repro.cpp
Microsoft (R) Incremental Linker Version 14.27.29009.1
Copyright (C) Microsoft Corporation.  All rights reserved.

/out:repro.exe
repro.obj

C:\temp>repro
locale name should not be empty

Expected behavior It should print success. That's what GCC does. Click to run with compiler explorer

STL version Microsoft Visual Studio Community 2019 Preview Version 16.7.0 Preview 3.1

Additional context Also tracked by DevCom-330322 and VSO-679264 / AB#679264."

Amaroker avatar Jul 11 '20 13:07 Amaroker

Issue number 2. Microsoft's implementation throws "bad locale name" for valid locales that ends with ".utf8" or "UTF-8". According to the original reporter, issue number 2 was a regression in your libraries and worked in version 15.8.0,

I believe this never worked, we just got better about reporting the error. I know there were cases before where we didn't correctly report the error from the underlying call to setlocale, so you didn't get UTF-8, you got whatever the default encoding was. If I understand correctly our CRT only supports UTF-8 behind compatibility flags / manifests but I don't know the exact details.

Issue number 3. codecvt_byname::in is reported to return an error and so the unicode conversion fails.

I don't believe we've ever had support for UTF-8 <-> UTF-16 in codecvt_byname. UTF-8 <-> UCS-2 might work. (Note that a correct UTF-8 <-> UTF-16 facet cannot be used with fstream)

It should print success. That's what GCC does.

GCC's standard library has UTF-8 support as a first class thing, ours does not.

BillyONeal avatar Sep 28 '20 18:09 BillyONeal

other compilers returns "C".

they don't, at least clang with libc++ also returns empty

https://gcc.godbolt.org/z/hnTrbc

But I agree that we should not return empty locale name.

fsb4000 avatar Oct 01 '20 08:10 fsb4000

We weren't able to merge @Agrael1's PR (see @barcharcraz's review there), but one change may be worth looking into. The following lines might be possible to remove: https://github.com/microsoft/STL/blob/ea092540b4a3347947863a447d820a9905a68c5b/stl/inc/xlocale#L296-L297 However, further investigation is required to fully understand (1) what these lines were attempting to do, and (2) what the behavioral impact of removing them would be. We believe this code has been functionally unchanged for ~20 years, and we don't understand all of its interactions.

StephanTLavavej avatar Nov 09 '22 22:11 StephanTLavavej

Are there any plans to fix the empty locale name issue?

ChrisHal avatar Apr 30 '25 08:04 ChrisHal