purescript-strings icon indicating copy to clipboard operation
purescript-strings copied to clipboard

genUnicodeString generates invalid unicode

Open martyall opened this issue 2 years ago • 4 comments

The unicode character generator for unicode characters is picking a random CodePoint in the BMP. The unicode string generator just generates an arbitrary array of such code points and turns it into a string. It turns out that this can generate invalid unicode via unpaired surrogates: https://unicode.org/faq/utf_bom.html#utf16-7

One solution here would be to restrict the code points to avoid such cases, another would be to figure out a more complicated but correct way to generate unicode which cannot be done CodePoint by CodePoint.

For context I discovered this while trying to write a quickcheck test for utf8 encoding/decoding, you can see the failing test here

martyall avatar Sep 13 '23 16:09 martyall

On a related note, it appears that the unicode Char generator includes the code point 65536 (chooseInt is inlcusive?). Shouldn't this be 65535? (FYI, I've verified that this is not causing my error)

martyall avatar Sep 13 '23 16:09 martyall

On a related note, it appears that the unicode Char generator includes the code point 65536 (chooseInt is inlcusive?). Shouldn't this be 65535? (FYI, I've verified that this is not causing my error)

It's wrong as a bound, but 65536 is just getting turned into 65535 via toEnumWithDefaults, which is effectively clamp.

natefaubion avatar Sep 13 '23 16:09 natefaubion

Does something like this seem right?

https://github.com/f-o-a-m/purescript-bytestrings/pull/1/files#diff-71732b478b4808898d86c8591ad7ab46d8122c1e4facec4a9151ac49efba905dR107-R137

At least it passes the utf-8 round trip

martyall avatar Sep 13 '23 17:09 martyall

Seems reasonable to me :+1:

garyb avatar Sep 24 '23 10:09 garyb