utfcpp Utf16 strings codepoint iteration and appending

A couple of important missing features is the ability to directly iterate utf16 strings codepoints and appending codepoints to existing utf16 encoded strings. For iterating codepoints, one implementation can be found in ICU documentation[1][2].

[1] https://unicode-org.github.io/icu-docs/apidoc/dev/icu4c/utf16_8h.html#a844bb48486904fdca40c8b883e9c80ee [2] https://unicode-org.github.io/icu-docs/apidoc/dev/icu4c/utf16_8h.html#ae98a64ae0f42bc6ad4179293c3638be4

Jun 15 '22 06:06 ceztko

For appending codepoints to existing utf16 strings I am currently testing the following methods:

    template <typename word_iterator>
    word_iterator append16(uint32_t cp, word_iterator result)
    {
        if (!utf8::internal::is_code_point_valid(cp))
            throw invalid_code_point(cp);

        if (cp < 0x10000u) {                    // one word
            *(result++) = static_cast<uint16_t>(cp);
        }
        else {                                  // two words
            uint32_t cp_1 = cp - 0x10000u;
            *(result++) = static_cast<uint16_t>(cp_1 / 0x400u + 0xd800u);
            *(result++) = static_cast<uint16_t>(cp_1 % 0x400u + 0xdc00u);
        }

        return result;
    }

    namespace unchecked
    {
        template <typename word_iterator>
        word_iterator append16(uint32_t cp, word_iterator result)
        {
            if (cp < 0x10000u) {                    // one word
                *(result++) = static_cast<uint16_t>(cp);
            }
            else {                                  // two words
                uint32_t cp_1 = cp - 0x10000u;
                *(result++) = static_cast<uint16_t>(cp_1 / 0x400u + 0xd800u);
                *(result++) = static_cast<uint16_t>(cp_1 % 0x400u + 0xdc00u);
            }

            return result;
        }
    }

    inline void append(char32_t cp, std::u16string& s)
    {
        append16(uint32_t(cp), std::back_inserter(s));
    }

Jun 15 '22 08:06 ceztko

Planned for release 4.0. Thanks for the proposal.

Dec 29 '22 00:12 nemtrif

Thank you. In the mean time I notice is quite easy to iterate codepoints on utf16 strings using existing facilities. I did in podofo. I ask you if can add a valide_next like function reading from utf16 content as well.

Feb 13 '23 22:02 ceztko

Fixed in release 4.0.0

Oct 22 '23 20:10 nemtrif