Full UTF-16 support
Something I noticed when working on the string iterator (#704) is that currently strings are stored as ordinary Rust strings, which use UTF-8. However the javascript standard specifies that strings should use UTF-16. There are a few places where this difference is noticable. For example, the string π should have length 2 as it is made up of two code units, but currently shows as having length 1. Similarly, "π".charAt(0) should return some representation of the first code unit, which has a value of \ud83d and cannot be represented as a normal string.
There are a few ways this could be implemented:
- Continue storing strings as Rust strings, and using
str::encode_utf16(). This has the downside that certain operations, such as getting the length, now have to iterate through the whole string. It also complicates storing individual codepoints, such as"π".charAt(0). - Use the widestring crate or similar. Strings would therefore be stored as
U16String, and only converted to rust strings on display. - Just storing arrays of
u16s. This probably wouldn't be a good idea as we'd probably end up reimplementing a lot of the functionality of the widestring crate.
I had a bit of a go at this and it's going to be quite challenging, especially with performance. There are a number of things that are only implemented for str or String, such as parsing, regex, and son, all of which would need costly conversions between UTF-8 and UTF-16. This would suggest it might be best to just use Rust strings, but then we have the issue of storing invalid strings, which JavaScript supports.
We might need to research how other engines deal with this issue @joshwd36, thanks for looking into it. What was he outcome widestring crate?
you may find https://blog.mozilla.org/javascript/2014/07/21/slimmer-and-faster-javascript-strings-in-firefox/ interesting
What was he outcome widestring crate?
I used a local fork of it that I'd modified to give greater parity with String to do the investigation, replacing the inner value of RcString with it, and had a look at what errors there were. The main issue is that there is no way to have an as_str() method, so every time it uses a function that takes &str as an argument it forces an allocation to convert to a String. The only way I can see of mitigating that is using/rewriting libraries to use UTF-16 strings, which probably isn't feasible.
you may find https://blog.mozilla.org/javascript/2014/07/21/slimmer-and-faster-javascript-strings-in-firefox/ interesting
That seems to suggest that Firefox does use an internal UTF-16 representation, with the exception of the optimisation they're describing, although it also suggests they have a custom regex engine which presumably also operates on UTF-16 strings.
https://github.com/rylev/const-utf16 looks promising to convert from Rust's string literals to UTF-16 literals on const and static contexts.
For example, the string π should have length 2 as it is made up of two code units, but currently shows as having length 1
@joshwd36 or @jedel1043 are you able to explain a little more why this is happening? I would have expected the UTF-8 version to have a higher length but my knowledge in this area isn't great
For example, the string π should have length 2 as it is made up of two code units, but currently shows as having length 1
@joshwd36 or @jedel1043 are you able to explain a little more why this is happening? I would have expected the UTF-8 version to have a higher length but my knowledge in this area isn't great
It was an old bug we had, but @joshwd36 fixed it here:
https://github.com/boa-dev/boa/commit/87d9e9cea82c1ea675082063f73296538bb3f46f#diff-796dedc2c80b4163e38e66d39288c24707abd5e32ff4151e32a561bf2b0488b7R959
Essentially, it was because UTF-8 considers any of its 8 bit, 16 bit, 24 bit or 32 bit variable code points as a whole "Unicode Scalar Value", and "π" can be represented in utf-8 with a single 16-bit scalar value (F0 9F 99 82), hence a length of 1. However, Javascript considers the length of a string as the number of code units within the string, and "π" needs two 16-bit code units to be encoded in UTF-16, hence a length of 2.
@joshwd36 not sure if you saw but thereβs now a PR for this https://github.com/boa-dev/boa/pull/1659
This was closed in #1659