libyaml icon indicating copy to clipboard operation
libyaml copied to clipboard

Certain Chinese characters are encoded with \U... prefix

Open tg-m opened this issue 4 years ago • 1 comments

Hi,

I was switching from Python-based YAML implementation to libyaml (for faster _load and _dump) and it seems that there are some Chinese characters that are not correctly (or maybe it is a feature?) emitted/dumped. This happens regardless of how input string is served (single quoted, double quoted, |- delimited). I enclosed the terminal output below.

I am aware that these are characters above 0xFFFF. (And that they are more character components than characters, just in case if someone wanted to point out that these are not in current/wide use.)

libyaml version: acd6f6f014c25e46363e718381e0b35205df2d83 (HEAD of master as of 2021.07.01)

𠂉 is changed to \U00020089 𠂤 -> \U000200A4

Also when 𠂤 or 𠂉 is found in the input the whole string is put into (double) quotes.

$ ./run-emitter -u /tmp/in.yaml 
[1] Parsing, emitting, and parsing again '/tmp/in.yaml': PASSED (length: 255)
Hanzi: |-
  (卌) (𠂉) (夕㐄) (舞)  [𠂤阜]  (灬) (卌) (𠂉) (無)  (夕㐄)
Inline: "(灬) (卌) (𠂉) (無) [𠂤阜]"
OneQuote: '(灬) (卌) (𠂉) (無) [𠂤阜]'
WontQuote: |-
  (卌) (夕㐄) (舞)
#### (length: 216)
OUTPUT:
Hanzi: "(卌) (\U00020089) (夕㐄) (舞)  [\U000200A4阜]  (灬) (卌) (\U00020089) (無)  (夕㐄)"
Inline: "(灬) (卌) (\U00020089) (無) [\U000200A4阜]"
OneQuote: "(灬) (卌) (\U00020089) (無) [\U000200A4阜]"
WontQuote: |-
  (卌) (夕㐄) (舞)
#### (length: 255)

tg-m avatar Jul 01 '21 17:07 tg-m

Hi, would you please provide in.yaml so we can reproduce it?

ziyangc97 avatar Sep 02 '22 07:09 ziyangc97