sonic-rs icon indicating copy to clipboard operation
sonic-rs copied to clipboard

Add support for non UTF-8 json input

Open wbprime opened this issue 1 year ago • 5 comments

Is your feature request related to a problem? Please describe.

sonic-rs would fail if the input bytes contain non UTF-8 characters, even for pub fn from_slice<'a, T>(json: &'a [u8]) function. However, there exists cases bytes containning non UTF-8 json need serialize/deserialize support, typically encoding GBK/GB18030 in China.

Describe the solution you'd like

  • add support for non UTF-8 encoded json bytes in from_slice function
  • or drop from_slice function

Describe alternatives you've considered

N/A.

Additional context N/A.

wbprime avatar Apr 19 '24 05:04 wbprime

Hello, according to the json rfc, unicode encoding is enforced.

image

Furthermore, does other json library such as serde_json, simd_json support non utf-8 input?

PureWhiteWu avatar Apr 19 '24 06:04 PureWhiteWu

@PureWhiteWu sorry for late reply.

serde_json can deserialize non UTF-8 bytes. simd_json not tested.

Aware your design principle to adhere to JSON std. However, UTF-8 is not the only encoding impl of unicode. Say, if UTF-16 support is on your roadmap, maybe other non unicode encoding support could be simply achieved with little effort I guess.

Moreover, JSON std suggests support non UTF-8 encoding as an impl extension.

Last words: GBK/GB18030 encoding is much like UTF-8 keeping compatible with ASCII making it easy to support.

Thanks

wbprime avatar Apr 24 '24 11:04 wbprime

@PureWhiteWu sorry for late reply.

serde_json can deserialize non UTF-8 bytes. simd_json not tested.

Aware your design principle to adhere to JSON std. However, UTF-8 is not the only encoding impl of unicode. Say, if UTF-16 support is on your roadmap, maybe other non unicode encoding support could be simply achieved with little effort I guess.

Moreover, JSON std suggests support non UTF-8 encoding as an impl extension.

Last words: GBK/GB18030 encoding is much like UTF-8 keeping compatible with ASCII making it easy to support.

Thanks

Thanks, could you give a test case with code? I know serde_json will only not fail when parsing invalid UTF-8 into bytes.

liuq19 avatar Apr 25 '24 05:04 liuq19

@liuq19 See this repository for your convenience.

wbprime avatar Apr 26 '24 07:04 wbprime

Thanks, we will investigate it

liuq19 avatar Apr 26 '24 08:04 liuq19