utfcpp icon indicating copy to clipboard operation
utfcpp copied to clipboard

Utf16 strings codepoint iteration and appending

Open ceztko opened this issue 3 years ago • 1 comments

A couple of important missing features is the ability to directly iterate utf16 strings codepoints and appending codepoints to existing utf16 encoded strings. For iterating codepoints, one implementation can be found in ICU documentation[1][2].

[1] https://unicode-org.github.io/icu-docs/apidoc/dev/icu4c/utf16_8h.html#a844bb48486904fdca40c8b883e9c80ee [2] https://unicode-org.github.io/icu-docs/apidoc/dev/icu4c/utf16_8h.html#ae98a64ae0f42bc6ad4179293c3638be4

ceztko avatar Jun 15 '22 06:06 ceztko

For appending codepoints to existing utf16 strings I am currently testing the following methods:

    template <typename word_iterator>
    word_iterator append16(uint32_t cp, word_iterator result)
    {
        if (!utf8::internal::is_code_point_valid(cp))
            throw invalid_code_point(cp);

        if (cp < 0x10000u) {                    // one word
            *(result++) = static_cast<uint16_t>(cp);
        }
        else {                                  // two words
            uint32_t cp_1 = cp - 0x10000u;
            *(result++) = static_cast<uint16_t>(cp_1 / 0x400u + 0xd800u);
            *(result++) = static_cast<uint16_t>(cp_1 % 0x400u + 0xdc00u);
        }

        return result;
    }

    namespace unchecked
    {
        template <typename word_iterator>
        word_iterator append16(uint32_t cp, word_iterator result)
        {
            if (cp < 0x10000u) {                    // one word
                *(result++) = static_cast<uint16_t>(cp);
            }
            else {                                  // two words
                uint32_t cp_1 = cp - 0x10000u;
                *(result++) = static_cast<uint16_t>(cp_1 / 0x400u + 0xd800u);
                *(result++) = static_cast<uint16_t>(cp_1 % 0x400u + 0xdc00u);
            }

            return result;
        }
    }

    inline void append(char32_t cp, std::u16string& s)
    {
        append16(uint32_t(cp), std::back_inserter(s));
    }

ceztko avatar Jun 15 '22 08:06 ceztko

Planned for release 4.0. Thanks for the proposal.

nemtrif avatar Dec 29 '22 00:12 nemtrif

Thank you. In the mean time I notice is quite easy to iterate codepoints on utf16 strings using existing facilities. I did in podofo. I ask you if can add a valide_next like function reading from utf16 content as well.

ceztko avatar Feb 13 '23 22:02 ceztko

Fixed in release 4.0.0

nemtrif avatar Oct 22 '23 20:10 nemtrif