boa icon indicating copy to clipboard operation
boa copied to clipboard

Full UTF-16 support

Open joshwd36 opened this issue 5 years ago β€’ 6 comments

Something I noticed when working on the string iterator (#704) is that currently strings are stored as ordinary Rust strings, which use UTF-8. However the javascript standard specifies that strings should use UTF-16. There are a few places where this difference is noticable. For example, the string πŸ™‚ should have length 2 as it is made up of two code units, but currently shows as having length 1. Similarly, "πŸ™‚".charAt(0) should return some representation of the first code unit, which has a value of \ud83d and cannot be represented as a normal string.

There are a few ways this could be implemented:

  • Continue storing strings as Rust strings, and using str::encode_utf16(). This has the downside that certain operations, such as getting the length, now have to iterate through the whole string. It also complicates storing individual codepoints, such as "πŸ™‚".charAt(0).
  • Use the widestring crate or similar. Strings would therefore be stored as U16String, and only converted to rust strings on display.
  • Just storing arrays of u16s. This probably wouldn't be a good idea as we'd probably end up reimplementing a lot of the functionality of the widestring crate.

joshwd36 avatar Sep 29 '20 18:09 joshwd36

I had a bit of a go at this and it's going to be quite challenging, especially with performance. There are a number of things that are only implemented for str or String, such as parsing, regex, and son, all of which would need costly conversions between UTF-8 and UTF-16. This would suggest it might be best to just use Rust strings, but then we have the issue of storing invalid strings, which JavaScript supports.

joshwd36 avatar Jan 18 '21 23:01 joshwd36

We might need to research how other engines deal with this issue @joshwd36, thanks for looking into it. What was he outcome widestring crate?

you may find https://blog.mozilla.org/javascript/2014/07/21/slimmer-and-faster-javascript-strings-in-firefox/ interesting

jasonwilliams avatar Jan 19 '21 00:01 jasonwilliams

What was he outcome widestring crate?

I used a local fork of it that I'd modified to give greater parity with String to do the investigation, replacing the inner value of RcString with it, and had a look at what errors there were. The main issue is that there is no way to have an as_str() method, so every time it uses a function that takes &str as an argument it forces an allocation to convert to a String. The only way I can see of mitigating that is using/rewriting libraries to use UTF-16 strings, which probably isn't feasible.

you may find https://blog.mozilla.org/javascript/2014/07/21/slimmer-and-faster-javascript-strings-in-firefox/ interesting

That seems to suggest that Firefox does use an internal UTF-16 representation, with the exception of the optimisation they're describing, although it also suggests they have a custom regex engine which presumably also operates on UTF-16 strings.

joshwd36 avatar Jan 19 '21 00:01 joshwd36

https://github.com/rylev/const-utf16 looks promising to convert from Rust's string literals to UTF-16 literals on const and static contexts.

jedel1043 avatar Oct 07 '21 18:10 jedel1043

For example, the string πŸ™‚ should have length 2 as it is made up of two code units, but currently shows as having length 1

@joshwd36 or @jedel1043 are you able to explain a little more why this is happening? I would have expected the UTF-8 version to have a higher length but my knowledge in this area isn't great

jasonwilliams avatar Nov 02 '21 15:11 jasonwilliams

For example, the string πŸ™‚ should have length 2 as it is made up of two code units, but currently shows as having length 1

@joshwd36 or @jedel1043 are you able to explain a little more why this is happening? I would have expected the UTF-8 version to have a higher length but my knowledge in this area isn't great

It was an old bug we had, but @joshwd36 fixed it here:

https://github.com/boa-dev/boa/commit/87d9e9cea82c1ea675082063f73296538bb3f46f#diff-796dedc2c80b4163e38e66d39288c24707abd5e32ff4151e32a561bf2b0488b7R959

Essentially, it was because UTF-8 considers any of its 8 bit, 16 bit, 24 bit or 32 bit variable code points as a whole "Unicode Scalar Value", and "πŸ™‚" can be represented in utf-8 with a single 16-bit scalar value (F0 9F 99 82), hence a length of 1. However, Javascript considers the length of a string as the number of code units within the string, and "πŸ™‚" needs two 16-bit code units to be encoded in UTF-16, hence a length of 2.

jedel1043 avatar Nov 02 '21 15:11 jedel1043

@joshwd36 not sure if you saw but there’s now a PR for this https://github.com/boa-dev/boa/pull/1659

jasonwilliams avatar Oct 03 '22 11:10 jasonwilliams

This was closed in #1659

Razican avatar Oct 21 '22 08:10 Razican