Add a utf8() mode that allows byte/UTF-8 strings as input & output.

Open FGasper opened this issue 3 years ago • 1 comments

MAINTAINER: See what you think of this. I’ll add documentation updates if you’re amenable to the change itself.

JSON::PP has a number of options that indicate a desire to facilitate different applications’ nonstandard needs. For example, latin1() caters to applications that use Latin-1 encoding rather than UTF-8, which violates the JSON specification.

Some nontrivial Perl applications forgo character decoding. Their authors/maintainers may not know “perlunitut”’s recommended workflow, or the application may simply not care about Unicode. Either way, in such applications it’s ideal for a JSON encoder & decoder to forgo the usual UTF-8 decode/encode steps.

utf8(0) almost achieves this. It falls over, though, if the JSON document contains a Unicode character escape (e.g., "\u00e9"), which JSON::PP decodes as Perl "\xe9". This causes an inconsistency in the decode logic: "é" in UTF-8 will yield a different result from "\u00e9".

Ordinarily it works to do encode_utf8( JSON::PP->new->utf8->decode(..) ), but that falls over if applications need to allow non-UTF-8 sequences in JSON inputs.

In short, a need exists for this Perl string:

qq<"\xff\xc3\xa9\xc3\xa9\\u00e9">

… to decode to "\xff\xc3\xa9\xc3\xa9".

This changeset adds a solution to this problem by changing utf8() from a simple flag to an enum: the existing chars-in-chars-out (0) and bytes-in-chars-out (1) options, plus a new bytes-in-bytes-out option. Named constants are added to avoid “magic numbers”.

Sep 07 '22 16:09 FGasper

I understand your point but if the change breaks compatibility with JSON::XS, it's unacceptable because JSON::PP is basically a fallback module of it. I am also reluctant to add a new mode if it's for JSON::PP only. Could you discuss this with the JSON::XS maintainer first?

Sep 07 '22 22:09 charsbar