encoding icon indicating copy to clipboard operation
encoding copied to clipboard

Add support for [AllowResizable] inputs

Open WebReflection opened this issue 11 months ago • 7 comments

What is the issue with the Encoding Standard?

I did reply with a (apologies bad) link here https://github.com/whatwg/encoding/issues/172 but I also documented issues related to SharedArrayBuffer or resizable ArrayBuffer (not even shared) here: https://gist.github.com/WebReflection/3324b5ac79768c85efbf3b725d6a9d73

Background

We are using Dynamic Workers and Atomics to simulate a blocking operation from interpreted PLs (or even JavaScript itself) that do the following:

  • create a SAB of length 8 (2 * Int32 bytes length) and post it with proxied details to the main thread
  • wait sync to be sure notify happened (an old Firefox issue that won't be resolved) at index 0 and the length (max int32 positive boundary) of the resulting binary-serialized data is known
  • we postMessage the new SAB with such length + 4 bytes (due notify issue in Firefox when index 0 is assigned to 0 and then notified)
  • the previously binary-serialized outcome is stored in the SAB via a view and index 0 is set to 1 to notify it's ready
  • the worker grab via "same" view binary content, deserializes it and it moves forward

Issues with this approach

Mostly performance but also memory: a tab with a worker that uses this strategy uses a lot of RAM (at least twice the ram) for every single operation until that's completed and this is bad for mobile phones or less powerful devices, or people with just dozens opened tabs that use similar strategy.

Ideally

I am working to refactor that dance to work in this ideal way:

  • we create a resizable SharedArrayBuffer (max Int32 upper size or half of it as growability) on the worker that can also be reused as there's no concurrency while it's synchronously waiting via Atomics
  • we handle proxy details and send such SAB right away
  • the main binary-serialize results directly in such SAB and notify it's ready
  • the worker binary-deserialize the reused SAB and keep going

This refactoring has the following obvious advantages:

  • there is only one SAB per worker (and usually one worker per main thread), thanks to the fact growable SAB is widely usable these days
  • there is a single postMessage dance
  • the binary-serializer never creates unnecessary intermediate representation of whatever value that needs to be stored into a buffer
  • the binary-deserializer never creates unnecessary intermediate buffer slices or whatsoever to retrieve the result of this roundtrip
  • the memory consumption is kept minimal, the growing is predictable, everything is faster

The current issue

I've spent way more time than I should've to find a performant way to avoid one-off creation of typed array views that bloat in RAM and bother GC because new TextEncoder().encodeInto(str, view) does not work with SharedArrayBuffer and nether does new TextDecoder().decode(view) but, most importantly, even if I wanted to use at least a resizable ArrayBuffer as fallback the The provided Uint8Array value must not be resizable error comes up.

To solve these issues I ended up ignoring entirely these APIs because these are not suitable for more complex scenarios and I start wondering what was the whole purpose of encodeInto and decode when where it's needed, binary data tha travels across realms or WASM exchanges, cannot use memory that can grow and shrink on demand.

Issue summary

  • these APIs have an extremely narrowed use case which easily results int bloated RAM to create new buffers all over the place by design
  • these APIs can be easily bypassed in purpose by DataView or direct view manipulation via JS code, defeating entirely the original guards meant to help developers, but the reality is that these limitations are just on the way when any developer caring about RAM and performance would like to use the platform
  • it's not clear why even resizable, single owned, ArrayBuffer cannot be used for synchronous blocking operations such as encodeInto
  • it's not clear how developers sure that a SAB cannot have concurrent access and is safe to use, can use these APIs
  • it's clear (to me) and sad I should avoid these APIs as opposite of trusting the platform does the right thing

I hope at least some of these concerns can be either tackled or answered and, regardless, I feel like none of these issues is well documented out there so I'll write a post with demoes and benchmarks about how to ignore TextEncoder & TextDecoder but I am afraid that won't make developers happy, rather slightly confused about the fact anyone can workaround these limitations by ignoring native APIs and keep doing what they need to do.

Thanks for your patience in reading this and thanks in advance for any possible action around these issues that could make usage of these native APIs more appealing in the (hopefully) near future.

WebReflection avatar Mar 16 '25 16:03 WebReflection

The issue you describe is not an issue with the specification. It's a bug in Chrome and Firefox:

  • https://wpt.fyi/results/encoding/encodeInto.any.html
  • https://wpt.fyi/results/encoding/sharedarraybuffer.https.html

Right?

annevk avatar Apr 24 '25 08:04 annevk

Also, it would be much better to state upfront what problem you are trying to solve. That's a lot of text for SharedArrayBuffer does not work.

annevk avatar Apr 24 '25 08:04 annevk

Right?

the first link doesn't explicitly tell me the SharedArrayBuffer or the ArrayBuffer are resizable and the commit link, or the link beside it, gives me a 404.

the second link does not use growable SharedArrayBuffer so I am not sure if those are extra bugs on top of this one.

Also, it would be much better to state upfront what problem you are trying to solve.

I want to use a resizable SharedArrayBuffer with TextEncoder.encode and TextDecoder.decode and that is not possible, summarized in this gist which contain a setup for references and tests that either pass or fail when ArrayBuffer or SharedArrayBuffer are resizable: https://gist.github.com/WebReflection/3324b5ac79768c85efbf3b725d6a9d73

The minimal reproducible case is:

const ref = new TextEncoder().encode("hello");

const gsab = new Uint8Array(new SharedArrayBuffer(5, { maxByteLength: 16 }));
gsab.set(ref, 0);

// decoder failing
const decoder = new TextDecoder;
decoder.decode(gsab); // ❌
decoder.decode(gsab.subarray(0)); // ❌

// encoder failing
const encoder = new TextEncoder;
encoder.encodeInto("hello", gsab); // ❌
encoder.encodeInto("hello", gsab.subarray(0)); // ❌

If I don't use these primitives or I create manually a typed array and then set it into the SharedArrayBuffer it works but that means duplicated RAM which is not an option on constrained devices when the serialized data could contain also blobs or buffers that will grow twice as much and requires GC to kick in before the memory can be used again.

That is the issue, I ended up ignoring these APIs because these don't help me with SharedArrayBuffer but it would be great if these API could work with resizable buffers like DataView or other TypedArray can already.

WebReflection avatar Apr 24 '25 09:04 WebReflection

That's strange, neither link 404s for me.

  • I think allowing this for decode() is straightforward and something we should do. It makes a copy (that could be optimized) so it doesn't really matter if the underlying buffer can change in size.
  • I think the same applies to TextDecoderStream. "decode and enqueue a chunk" makes a copy (that could be optimized) so it doesn't really matter if the underlying buffer can change in size.
  • For encodeInto() it's a bit less clear. Should the API attempt to grow the buffer if the input doesn't fit, for instance? And can the write be atomic so it doesn't grow or shrink at the same time we are writing into it?

annevk avatar Apr 24 '25 11:04 annevk

Should the API attempt to grow the buffer if the input doesn't fit, for instance?

no, the returned written value is clear already and works as expected when the buffer is not shared so I think it should do exactly what it does with every single buffer so expectations are clear and less magic is involved, specially because the new length might not be available if it exceeds maxByteLength so I'd keep it simple, to be honest, as long as we can work with resizable buffers.

This is also because while a SharedArrayBuffer can only grow, ArrayBuffer can shrink so that if the new length is silently ignored there's no way to regain a one-off possible overly-demanding encoding (i.e. a very long string when the norm is a short one).

For SaB that's inevitable and a new SaB might be created, when needed, but those returned values help to properly orchestrate any logic around.

can the write be atomic so it doesn't grow or shrink at the same time we are writing into it?

it needs to be atomic, hence it should not grow it while writing ... let developers deal with it and let them use properly the read and written returned values ... keep it consistent and don't change the logic, just accept resizable buffers or SaB?

WebReflection avatar Apr 24 '25 12:04 WebReflection

it's not clear how developers sure that a SAB cannot have concurrent access and is safe to use, can use these APIs

If we could trust the developer it would be fine for them to say "I promise you have exclusive access to this range of this SaB". But some developers are malicious, and would like to exploit any sensitivity our text codecs have to concurrent modification of the input or output.

If I don't use these primitives or I create manually a typed array and then set it into the SharedArrayBuffer it works but that means duplicated RAM which is not an option on constrained devices when the serialized data could contain also blobs or buffers that will grow twice as much and requires GC to kick in before the memory can be used again.

In practice browsers have to deal with the input or output being in a SharedArrayBuffer by adding an extra copy. We could in principle implement our text codecs in wasm so that the JavaScript VM would protect us from security issues, but we don't want to ship two copies of our text codecs and we can't compromise their performance in the normal case.

In some cases it is simply better to ship your own text codecs rather than use the platform ones.

ricea avatar Apr 30 '25 09:04 ricea

Upon further consideration I think supporting [AllowResizable] for all of these is reasonable. The weirdest thing that could happen is that encodeInto() returns numbers that are no longer accurate because the buffer grew, but that would only happen in some kind of race with the encodeInto() call, at which point that's to be expected.

And yes, the specification will continue to define this as doing a copy that could potentially be elided with a lot of careful work.

annevk avatar Apr 30 '25 09:04 annevk