protobuf icon indicating copy to clipboard operation
protobuf copied to clipboard

Messy code when returning Chinese characters

Open Juice007 opened this issue 1 year ago • 4 comments

What version of protobuf and what language are you using? Version: main/v3.6.0/v3.5.0 Language:GO、Objective-C

What operating system (Linux, Windows, ...) and version? iOS

What runtime / compiler are you using (e.g., python version or gcc version)

What did you do? Steps to reproduce the behavior: 1、When the request returns Chinese, iOS get a unicode code in the response header: image Decoding into Chinese is a mess

\U00e7\U0094\U00a8\U00e6\U0088\U00b7\U00e4\U00b8\U008d\U00e5\U00ad\U0098\U00e5\U009c\U00a8

mess code :

用户不存在

expected Chinese:用户不存在

Juice007 avatar Apr 29 '24 11:04 Juice007

It looks like this has taken UTF-8 encoded text and turned the encoded bytes into individual code points. Instead of decoding to UTF-8 and then encoding those codepoints into \U codes. When converting this to the individual bytes, the proper Chinese text is produced.

puellanivis avatar May 02 '24 12:05 puellanivis

It looks like this has taken UTF-8 encoded text and turned the encoded bytes into individual code points. Instead of decoding to UTF-8 and then encoding those codepoints into \U codes. When converting this to the individual bytes, the proper Chinese text is produced.

Sorry. I'm still a little confused about what you mean.

Juice007 avatar May 06 '24 02:05 Juice007

@puellanivis Can you tell me in detail what I should do?

Juice007 avatar May 06 '24 13:05 Juice007

It looks like this has taken UTF-8 encoded text and turned the encoded bytes into individual code points. Instead of decoding to UTF-8 and then encoding those codepoints into \U codes. When converting this to the individual bytes, the proper Chinese text is produced.

Sorry. I'm still a little confused about what you mean.

Removing all the \U00, and then hex decoding yields the intended Chinese text: https://go.dev/play/p/KG54AtomS5p

Alternatively to the Go playground instance, thanks to the %-encoding of URIs, this can also be seen with a simple data-URI: data:,%e7%94%a8%e6%88%b7%e4%b8%8d%e5%ad%98%e5%9c%a8 (Chrome shows me the same garbled nonsense on the page, but the URI shows the correct Chinese.

Somehow the text seems to have ended up being converted from UTF-8 bytes directly into Unicode encoding points without proper decoding, à la:

func f(correctString string) string {
	buf := new(strings.Builder)
	for _, r := range []byte(correctString) {
		fmt.Fprintf(buf, "%c", r)
	}
	return buf.String()
}

https://go.dev/play/p/IPBEQzpuDce

I can’t really help you much further than pointing out that it’s the correct text, just encoded wrong (https://en.wikipedia.org/wiki/Mojibake) without any further code or such. I will note that the Originmsg appears to also be incorrectly encoded, and is the likely source of the problem with the Returnmsg. The Returnmsg is likely just simply repeating whatever it got from the Originmsg? In which case, we’re not doing anything wrong at all. The client is encoding the Originmsg wrong.

puellanivis avatar May 06 '24 16:05 puellanivis

Thanks @puellanivis ! Our final solution is that the resp returns the urlencoded Chinese string, and the client urldecode the string, which solves the problem urlencoded string: %E7%94%A8%E6%88%B7%E4%B8%8D%E5%AD%98%E5%9C%A8%0A urldecoded string:用户不存在

Juice007 avatar May 14 '24 12:05 Juice007