evaluate icon indicating copy to clipboard operation
evaluate copied to clipboard

Use correct encoding

Open flying-sheep opened this issue 9 years ago • 14 comments

on windows, to correctly parse text, you need to do:

parse(text = code, encoding = Encoding(code))

if the code has set an encoding, it gets returned by Encoding. Else, Encoding returns 'unknown', the default value for encoding in parse.

ideally, parse_all should internally do the above instead of a encoding-less parse call.

an alternative would be to pass down an encoding parameter through evaluateparse_allparse

flying-sheep avatar Apr 11 '16 08:04 flying-sheep

This isn't the problem here: it seems that encoding makes no difference at all (in a RStudio on windows 7, 64bit, R 3.2.x):

> s <- '"法 \\u8FDB"'
> parse(text=s, encoding = Encoding(s))
expression("<U+6CD5> \u8FDB")
> s <- '"法 \\u8FDB"'
> Encoding(s)
[1] "UTF-8"
> parse(text=s, encoding = Encoding(s))
expression("<U+6CD5> \u8FDB")

Just to show the original problem:

library(evaluate)

code <- "
x = '法'
y = '\\u8FDB'
print(nchar(x))
print(nchar(y))
print(x)
print(y)
"


l = list()
txt <- function(o, type) {
  t <- paste(o, collapse = '\n')
  l[length(l)+1] <<- t
}
oh <- new_output_handler(source = identity, 
                         text = function(o) txt(o, "text"), 
                         graphics = identity,
                         message = identity, 
                         warning = identity, 
                         error = identity, 
                         value = identity)


x <- evaluate(code, output_handler = oh)
l
> l
[[1]]
[1] "[1] 8\n" # -> the unicode char is already wring when it gets executed

[[2]]
[1] "[1] 1\n"

[[3]]
[1] "[1] \"<U+6CD5>\"\n"

[[4]]
[1] "[1] \"<U+8FDB>\"\n" # -> And even "good" unicode chars get mangled on the way out.

jankatins avatar Apr 11 '16 09:04 jankatins

OK. code is obviously good UTF-8 since it comes from user input, right?

doesn’t that mean that using parse(text = code, encoding = Encoding(code)) creates the right object, and only the printing gets messed up?

if so, will our capture.output(print(obj)) introduce the error?

flying-sheep avatar Apr 11 '16 11:04 flying-sheep

important printing stuff (i think)

WinUTF8out and EncodeString

flying-sheep avatar Apr 11 '16 11:04 flying-sheep

doesn’t that mean that using parse(text = code, encoding = Encoding(code)) creates the right object, and only the printing gets messed up?

No: it's already messed up because nchar(x) in the evaluate call returns 8, which happens when it sees a string like <U+6CD5>.

But the 4th case in the evaluate might be because of bad printing behaviour...

if so, will our capture.output(print(obj)) introduce the error?

Seems like it at least explains the second error (the [4]): e.g. executing '\u8FDB';print('\u8FDB') in irkernel, which is equivalent to

code <- "
y = '\\u8FDB'
y
print(y)
"

and then doing capture.output(print(value)) in the value output handler

leads to this:

options(jupyter.rich_display = FALSE) # to only get print and not html
'\u8FDB'
print('\u8FDB')
[1] "<U+8FDB>"
[1] "<U+8FDB>"

-> Both are mangled, no matter if the print happens within evaluate or outside. So it seems a problem that sinked cons can't handle UTF-8 on non-UTF-8 systems? But that still doesn't explain the problem "coming in".

jankatins avatar Apr 11 '16 11:04 jankatins

Knitr has the same problem:

# code cell marker broken to not confuse github...
` ``{r} 
x = '法'
y = '\u8FDB'
print(nchar(x))
print(nchar(y))
print(x)
print(y)
x
y
` ``

Outputs:

## [1] 8 -> alread 8 chars when executed -> broken coming in
## [1] 1 -> escaping works for coming in
## [1] "<U+6CD5>" # -> just the broken 8 char string printed again
## [1] "<U+8FDB>" # -> printed in evaluate -> broken
## [1] "<U+6CD5>" # -> again the broken 8 char string
## [1] "<U+8FDB>" # -> printed by knit_print and broken there...

Another mentioning of this problem: https://stat.ethz.ch/pipermail/r-help//2014-April/373558.html (without an answer... :-()

jankatins avatar Apr 11 '16 11:04 jankatins

And this also doesn't make a difference (in RStudio, win7):

> f <- textConnection("rval2", "w", local=TRUE, encoding = "UTF-8")
> sink(f)
> print('法')
> print('\u8FDB')
> sink()
> print(rval2)
[1] "[1] \"<U+6CD5>\"" "[1] \"<U+8FDB>\""

jankatins avatar Apr 11 '16 12:04 jankatins

I once had a problem with encoding when the locale of my RStudio was not set. You find out by Sys.getlocale() I'm not sure if this applies here though.

expectopatronum avatar Apr 11 '16 12:04 expectopatronum

> Sys.getlocale()
[1] "LC_COLLATE=German_Germany.1252;LC_CTYPE=German_Germany.1252;LC_MONETARY=German_Germany.1252;LC_NUMERIC=C;LC_TIME=German_Germany.1252"

-> Looks ok to me (this is a plain RStudio session on win7)

The problem is that the executed code is UTF-8 (e.g. read from a file) and the output should also be UTF-8 (eg written to a file), so the locale should not matter?

jankatins avatar Apr 11 '16 13:04 jankatins

I think there is something fundamentally broken (related to sink()) in base R on Windows. See #59. Basically if the characters are not supported by the system native encoding, they will be silently converted to <U+XXXX> sequences.

yihui avatar Apr 11 '16 20:04 yihui

Any idea how to get out of this problem? This is basically a big problem in the current R kernel, because the "in" side of this problem prevents unicode code input on windows, resulting in wrong computations (e.g. probably wrong comparisons if one side comes from a file and one from a string defined in the notebook).

jankatins avatar Apr 11 '16 21:04 jankatins

No, I think this requires one to dig deep into the C code in base R, and I don't really understand that level of details.

This is a problem, but probably not as big as you imagined. This only becomes a problem when a Windows user has characters in the document that his/her Windows native character encoding does not support. I think this is relatively rare. In the above examples, as long as your Windows supports the Chinese locale, you should be fine (Sys.setlocale(, 'Chinese')). As you mentioned, knitr suffers from the same problem, but over the four years, this issue has bitten users at most three times as far as I can remember.

yihui avatar Apr 11 '16 21:04 yihui

Thank you for all your work on this. This comment is coming from someone still trying to understand the issue but I believe this is a problem for those of us who work with under-resourced languages, and linguists working with IPA data. If I've understood correctly, this issue is a problem for about half of the world's languages.

This is a problem, but probably not as big as you imagined.

speechchemistry avatar Jun 03 '20 13:06 speechchemistry

@speechchemistry We had to wait for Windows to support UTF-8; see #59. R core has been making effort: https://developer.r-project.org/Blog/public/2020/05/02/utf-8-support-on-windows/ Eventually we should be able to forget about character encodings on Windows. Trust me: my native language is Chinese, and I have felt the enormous pain for many years, but still have to wait.

yihui avatar Jun 04 '20 15:06 yihui

Hi, since R 4.2, it supports unicode on windows!

flying-sheep avatar Jun 27 '22 08:06 flying-sheep