Utf16 strings codepoint iteration and appending
A couple of important missing features is the ability to directly iterate utf16 strings codepoints and appending codepoints to existing utf16 encoded strings. For iterating codepoints, one implementation can be found in ICU documentation[1][2].
[1] https://unicode-org.github.io/icu-docs/apidoc/dev/icu4c/utf16_8h.html#a844bb48486904fdca40c8b883e9c80ee [2] https://unicode-org.github.io/icu-docs/apidoc/dev/icu4c/utf16_8h.html#ae98a64ae0f42bc6ad4179293c3638be4
For appending codepoints to existing utf16 strings I am currently testing the following methods:
template <typename word_iterator>
word_iterator append16(uint32_t cp, word_iterator result)
{
if (!utf8::internal::is_code_point_valid(cp))
throw invalid_code_point(cp);
if (cp < 0x10000u) { // one word
*(result++) = static_cast<uint16_t>(cp);
}
else { // two words
uint32_t cp_1 = cp - 0x10000u;
*(result++) = static_cast<uint16_t>(cp_1 / 0x400u + 0xd800u);
*(result++) = static_cast<uint16_t>(cp_1 % 0x400u + 0xdc00u);
}
return result;
}
namespace unchecked
{
template <typename word_iterator>
word_iterator append16(uint32_t cp, word_iterator result)
{
if (cp < 0x10000u) { // one word
*(result++) = static_cast<uint16_t>(cp);
}
else { // two words
uint32_t cp_1 = cp - 0x10000u;
*(result++) = static_cast<uint16_t>(cp_1 / 0x400u + 0xd800u);
*(result++) = static_cast<uint16_t>(cp_1 % 0x400u + 0xdc00u);
}
return result;
}
}
inline void append(char32_t cp, std::u16string& s)
{
append16(uint32_t(cp), std::back_inserter(s));
}
Planned for release 4.0. Thanks for the proposal.
Thank you. In the mean time I notice is quite easy to iterate codepoints on utf16 strings using existing facilities. I did in podofo. I ask you if can add a valide_next like function reading from utf16 content as well.
Fixed in release 4.0.0