node icon indicating copy to clipboard operation
node copied to clipboard

TextDecoder does not error incorrectly for legacy byte sequences

Open domenic opened this issue 4 years ago • 9 comments

Version

v16.9.1

Platform

Microsoft Windows NT 10.0.19043.0 x64

Subsystem

encoding

What steps will reproduce the bug?

Enter the following in the REPL:

new TextDecoder("Big5").decode(new Uint8Array([0x83, 0x5C])).charCodeAt(0).toString(16)

as well as

new TextDecoder("Big5").decode(new Uint8Array([0x83, 0x5C])).charCodeAt(1).toString(16)

How often does it reproduce? Is there a required condition?

Every time

What is the expected behavior?

fffd for the first, and 5c for the second (as in Firefox and Chrome, and per the WHATWG Encoding Standard)

What do you see instead?

f00e and NaN

Additional information

I suspect this has to do with you using ICU as-is, instead of properly patching it to match the Encoding Standard. There are probably more bugs like this.

@inexorabletash may be able to point to where in the Chromium source tree we keep our ICU encoding patches.

domenic avatar Sep 12 '21 20:09 domenic

In ICU, it's important to use the "HTML" tag when selecting encodings, which should select the web-compatible encodngs based on Encoding. That started off as Chromium-specific but was upstreamed.

For Chromium's remaining customization, https://source.chromium.org/chromium/chromium/src/+/main:third_party/icu/README.chromium is the right place to start, and it references a patch for gb18030/windows-936

I should note that Chrome differs from the Encoding standard for a few encodings (e.g. GBK / GB18030 have differences)

inexorabletash avatar Sep 13 '21 16:09 inexorabletash

This problem remains on Node.js v21.1.0.

Deno 1.38.0 does not have this problem, at least for the minimal test case in the OP. I suspect Bun does not either since it uses WebKit's TextEncoder/TextDecoder implementation.

Relevant web platform tests are https://wpt.fyi/results/encoding?label=experimental&label=master&aligned .

domenic avatar Nov 07 '23 13:11 domenic

Hey can I take up this issue?

agilan11 avatar Jan 15 '25 16:01 agilan11

@domenic Hey Can I take up this issue ?

adityaroshanpatro avatar Jan 27 '25 07:01 adityaroshanpatro

Can I work on this

Atikrg avatar May 24 '25 19:05 Atikrg

Image I did get the expected value on first try. why i am unable to reproduce the bug

Atikrg avatar May 24 '25 19:05 Atikrg

I took a look into this issue, but it doesn't seem that the HTML-encodings have been upstreamed to ICU. I can see in Chromium source code that @inexorabletash referenced, there is an HTML tag in convrtrs.txt as well as several -html.ucm files, but they don't exist in unicode-org/icu repo, nor unicode-org/icu-data.

I tried to build Node.js with Chromium's in-tree ICU, and without changing any code in Node.js side, it gives the expected behavior here. Actually, if I just build the data file icudt77l.dat from Chromium's in-tree ICU, and put it into deps/icu-tmp, the behavior would also be correct.

Given that, I guess the fix for this issue would really be to change the process of obtaining ICU, or at least the ICU data file. The easiest way I've found so far to produce the ICU data file from Chromium ICU is

git clone --depth=1 https://chromium.googlesource.com/chromium/deps/icu chromium-icu
cd chromium-icu/source
./runConfigureICU Linux
make -j

then the data file can be found in source/data/out/tmp/icudtXXl.dat.

upsuper avatar Nov 13 '25 06:11 upsuper

This is also incorrect:

> new TextDecoder('Big5').decode(Uint8Array.of(0x80)) // should be '\ufffd'
'\x80'
> new TextDecoder('Big5', { fatal: true }).decode(Uint8Array.of(0x80)) // should throw
'\x80'

ChALkeR avatar Dec 09 '25 17:12 ChALkeR

While both the original issue and the previous comment deal with invalid input, the issue is worse

Encodings that should be ASCII supersets fail at being ASCII supersets

> new TextDecoder('ibm866').decode(Uint8Array.of(0x1A))
'\x1C'
> new TextDecoder('shift_jis').decode(Uint8Array.of(0x1A))
'\x1C'

> new TextDecoder('ibm866').decode(Uint8Array.of(0x1C))
'\x7F'
> new TextDecoder('shift_jis').decode(Uint8Array.of(0x1C))
'\x7F'

> new TextDecoder('ibm866').decode(Uint8Array.of(0x7F))
'\x1A'
> new TextDecoder('shift_jis').decode(Uint8Array.of(0x7F))
'\x1A'

ChALkeR avatar Dec 09 '25 19:12 ChALkeR

Welp, big5 is also wrong in Chrome: https://issues.chromium.org/issues/467727340

ChALkeR avatar Dec 11 '25 13:12 ChALkeR

Filed https://github.com/nodejs/node/issues/61041 with analysis

ChALkeR avatar Dec 13 '25 05:12 ChALkeR