boa Full UTF-16 support

Something I noticed when working on the string iterator (#704) is that currently strings are stored as ordinary Rust strings, which use UTF-8. However the javascript standard specifies that strings should use UTF-16. There are a few places where this difference is noticable. For example, the string 🙂 should have length 2 as it is made up of two code units, but currently shows as having length 1. Similarly, "🙂".charAt(0) should return some representation of the first code unit, which has a value of \ud83d and cannot be represented as a normal string.

There are a few ways this could be implemented:

Continue storing strings as Rust strings, and using str::encode_utf16(). This has the downside that certain operations, such as getting the length, now have to iterate through the whole string. It also complicates storing individual codepoints, such as "🙂".charAt(0).
Use the widestring crate or similar. Strings would therefore be stored as U16String, and only converted to rust strings on display.
Just storing arrays of u16s. This probably wouldn't be a good idea as we'd probably end up reimplementing a lot of the functionality of the widestring crate.

Sep 29 '20 18:09 joshwd36

I had a bit of a go at this and it's going to be quite challenging, especially with performance. There are a number of things that are only implemented for str or String, such as parsing, regex, and son, all of which would need costly conversions between UTF-8 and UTF-16. This would suggest it might be best to just use Rust strings, but then we have the issue of storing invalid strings, which JavaScript supports.

Jan 18 '21 23:01 joshwd36

We might need to research how other engines deal with this issue @joshwd36, thanks for looking into it. What was he outcome widestring crate?

you may find https://blog.mozilla.org/javascript/2014/07/21/slimmer-and-faster-javascript-strings-in-firefox/ interesting

Jan 19 '21 00:01 jasonwilliams

What was he outcome widestring crate?

I used a local fork of it that I'd modified to give greater parity with String to do the investigation, replacing the inner value of RcString with it, and had a look at what errors there were. The main issue is that there is no way to have an as_str() method, so every time it uses a function that takes &str as an argument it forces an allocation to convert to a String. The only way I can see of mitigating that is using/rewriting libraries to use UTF-16 strings, which probably isn't feasible.

you may find https://blog.mozilla.org/javascript/2014/07/21/slimmer-and-faster-javascript-strings-in-firefox/ interesting

That seems to suggest that Firefox does use an internal UTF-16 representation, with the exception of the optimisation they're describing, although it also suggests they have a custom regex engine which presumably also operates on UTF-16 strings.

Jan 19 '21 00:01 joshwd36

https://github.com/rylev/const-utf16 looks promising to convert from Rust's string literals to UTF-16 literals on const and static contexts.

Oct 07 '21 18:10 jedel1043

For example, the string 🙂 should have length 2 as it is made up of two code units, but currently shows as having length 1

@joshwd36 or @jedel1043 are you able to explain a little more why this is happening? I would have expected the UTF-8 version to have a higher length but my knowledge in this area isn't great

Nov 02 '21 15:11 jasonwilliams

For example, the string 🙂 should have length 2 as it is made up of two code units, but currently shows as having length 1

@joshwd36 or @jedel1043 are you able to explain a little more why this is happening? I would have expected the UTF-8 version to have a higher length but my knowledge in this area isn't great

It was an old bug we had, but @joshwd36 fixed it here:

https://github.com/boa-dev/boa/commit/87d9e9cea82c1ea675082063f73296538bb3f46f#diff-796dedc2c80b4163e38e66d39288c24707abd5e32ff4151e32a561bf2b0488b7R959

Essentially, it was because UTF-8 considers any of its 8 bit, 16 bit, 24 bit or 32 bit variable code points as a whole "Unicode Scalar Value", and "🙂" can be represented in utf-8 with a single 16-bit scalar value (F0 9F 99 82), hence a length of 1. However, Javascript considers the length of a string as the number of code units within the string, and "🙂" needs two 16-bit code units to be encoded in UTF-16, hence a length of 2.

Nov 02 '21 15:11 jedel1043

@joshwd36 not sure if you saw but there’s now a PR for this https://github.com/boa-dev/boa/pull/1659

Oct 03 '22 11:10 jasonwilliams

This was closed in #1659

Oct 21 '22 08:10 Razican