encoding icon indicating copy to clipboard operation
encoding copied to clipboard

Add a static decode and encode method to `TextEncoder` and `TextDecoder`

Open lucacasonato opened this issue 4 years ago • 10 comments

The large majority of users of TextEncoder and TextDecoder do not make use of streaming capabilities. Instead they just need to decode a single chunk, or encode a single chunk. We should consider adding a static utility method to TextEncoder / TextDecoder to cater to this use case.

// current
const data = new TextDecoder().decode(new Uint8Array(/** data here */));

// proposed
const data = TextDecoder.decode(new Uint8Array(/** data here */));

I propose the following API be added:

dictionary TextDecoderOptionsWithLabel : TextDecoderOptions {
  DOMString label = "utf-8";
};

interface TextDecoder {
  static USVString decode(optional [AllowShared] BufferSource input, optional TextDecoderOptionsWithLabel options = {});
}

interface TextEncoder {
  static Uint8Array encode(optional USVString input = "");
  static TextEncoderEncodeIntoResult encodeInto(USVString source, [AllowShared] Uint8Array destination);
}

The exact name of the method should be something different though - the static method and prototype method would name clash in this example which is confusing.

Internally this would behave exactly like a single use non streaming encoder.

lucacasonato avatar Jun 29 '21 19:06 lucacasonato

What about encodeInto()?

The exact name of the method should be something different though - the static method and prototype method would name clash in this example which is confusing.

Why would that be confusing?

domenic avatar Jun 29 '21 20:06 domenic

What about encodeInto()?

Added - forgot it existed.

Why would that be confusing?

Just checked spec - wouldn't just be confusing but also against webidl spec:

"The identifier of a static operation also must not be the same as the identifier of a regular operation defined on the same interface." - https://heycam.github.io/webidl/#idl-operations

As to why it would be confusing: "just call TextDecoder.decode" would have two meanings now. Also what would for example the MDN link be? https://developer.mozilla.org/en-US/docs/Web/API/TextDecoder/decode is already taken for the prototype method.

lucacasonato avatar Jun 29 '21 20:06 lucacasonato

We can fix Web IDL.

As for MDN's problem, they've had a big issue with bad documentation for prototype methods for a long time. They need to fix it. TextEncoder.decode() has always meant a static method (which hasn't existed); TextEncoder.prototype.decode() or textEncoder.decode() are how you write the instance methods.

domenic avatar Jun 29 '21 20:06 domenic

It's unclear what the benefit of the proposed change is. It doesn't save significant typing (just new); in cases where you want to avoid allocating a new instance, the encoder/decoder can be cached and used statelessly.

Given a time machine to redesign the API now that streams are a thing, this makes more sense for non-streaming encodes/decodes, but is it worth adding to the platform?

inexorabletash avatar Jun 29 '21 23:06 inexorabletash

If we add API surface that doesn't work in old browsers, I'd like to know if users of the potential API surface care about non-UTF-8 encodings. If not, I'd prefer adding new API surface to JavaScript String itself instead of TextDecoder, since the WebIDL layer is a perf bottle neck.

  • Constructing a String from Uint8Array assuming UTF-8 semantics.
  • Constructing a String from Uint16Array assuming WTF-16 semantics.
  • Method on String that writes UTF-8 into a Uint8Array with the encodeInto semantics but returns an array of numbers instead of a WebIDL type.
  • Method on String that writes WTF-16 into a Uint16Array.

hsivonen avatar Jul 05 '21 09:07 hsivonen

cc @bakkot @littledan

annevk avatar Jul 19 '21 09:07 annevk

https://github.com/whatwg/webidl/pull/1098 should technically unblock this. Are any browser vendors objected to this? If not, I'd like to start work on this.

lucacasonato avatar Mar 04 '22 00:03 lucacasonato

since the WebIDL layer is a perf bottle neck

@hsivonen How so? In Deno we have an optimized path in our WebIDL bindings for this specific API that bypasses the slow USVString conversion completely. This should be possible in other implementations too.

During encode you can convert the first argument to a DOMString (which is fast), and then just perform a lenient UTF-8 encoding on that string, which will already perform lone surrogate conversion. https://github.com/denoland/deno/blob/main/ext/web/08_text_encoding.js#L166-L175

lucacasonato avatar Mar 04 '22 00:03 lucacasonato

since the WebIDL layer is a perf bottle neck

@hsivonen How so? In Deno we have an optimized path in our WebIDL bindings for this specific API that bypasses the slow USVString conversion completely. This should be possible in other implementations too.

Gecko doesn't actually use USVString in TextEncoder WebIDL, either. However, the WebIDL layer still has enough overhead that e.g. wasm-bindgen conditionally avoids going through the WebIDL layer.

hsivonen avatar May 23 '22 11:05 hsivonen

Gecko doesn't actually use USVString in TextEncoder WebIDL, either. However, the WebIDL layer still has enough overhead that e.g. wasm-bindgen conditionally avoids going through the WebIDL layer.

Mh, that seems very weird to me. I very much doubt that WebIDL inherently needs to be a slowdown here. Currently the overhead likely stems from most engines slowing down when calling from JS to native code. This is a performance issue that can be fixed. In V8 based runtimes, TextEncoder.encodeInto can use V8's "fast API" for better JIT inlined performance for example.

The only WebIDL conversion taking place here is the DOMString conversion, which really is just a call to ToString which is a 0 cost op if the call is already a string. A native API in ES would also need to do some runtime type checking (likely the exact same checks), so I really don't see why performance should be a reason to have a native JS API.

TLDR: I don't think spec decisions should be driven by current performance issues in engines, as I don't think that there are any technical reasons for why TextEncoder#encodeInto must be slower than String#encodeIntoUTF8. It's just a matter of optimizing the codepaths as far as I can tell.

Please correct me if I missed something obvious here.

lucacasonato avatar May 23 '22 11:05 lucacasonato