Use correct encoding
on windows, to correctly parse text, you need to do:
parse(text = code, encoding = Encoding(code))
if the code has set an encoding, it gets returned by Encoding. Else, Encoding returns 'unknown', the default value for encoding in parse.
ideally, parse_all should internally do the above instead of a encoding-less parse call.
an alternative would be to pass down an encoding parameter through evaluate → parse_all → parse
This isn't the problem here: it seems that encoding makes no difference at all (in a RStudio on windows 7, 64bit, R 3.2.x):
> s <- '"法 \\u8FDB"'
> parse(text=s, encoding = Encoding(s))
expression("<U+6CD5> \u8FDB")
> s <- '"法 \\u8FDB"'
> Encoding(s)
[1] "UTF-8"
> parse(text=s, encoding = Encoding(s))
expression("<U+6CD5> \u8FDB")
Just to show the original problem:
library(evaluate)
code <- "
x = '法'
y = '\\u8FDB'
print(nchar(x))
print(nchar(y))
print(x)
print(y)
"
l = list()
txt <- function(o, type) {
t <- paste(o, collapse = '\n')
l[length(l)+1] <<- t
}
oh <- new_output_handler(source = identity,
text = function(o) txt(o, "text"),
graphics = identity,
message = identity,
warning = identity,
error = identity,
value = identity)
x <- evaluate(code, output_handler = oh)
l
> l
[[1]]
[1] "[1] 8\n" # -> the unicode char is already wring when it gets executed
[[2]]
[1] "[1] 1\n"
[[3]]
[1] "[1] \"<U+6CD5>\"\n"
[[4]]
[1] "[1] \"<U+8FDB>\"\n" # -> And even "good" unicode chars get mangled on the way out.
OK. code is obviously good UTF-8 since it comes from user input, right?
doesn’t that mean that using parse(text = code, encoding = Encoding(code)) creates the right object, and only the printing gets messed up?
if so, will our capture.output(print(obj)) introduce the error?
doesn’t that mean that using parse(text = code, encoding = Encoding(code)) creates the right object, and only the printing gets messed up?
No: it's already messed up because nchar(x) in the evaluate call returns 8, which happens when it sees a string like <U+6CD5>.
But the 4th case in the evaluate might be because of bad printing behaviour...
if so, will our capture.output(print(obj)) introduce the error?
Seems like it at least explains the second error (the [4]): e.g. executing '\u8FDB';print('\u8FDB') in irkernel, which is equivalent to
code <- "
y = '\\u8FDB'
y
print(y)
"
and then doing capture.output(print(value)) in the value output handler
leads to this:
options(jupyter.rich_display = FALSE) # to only get print and not html
'\u8FDB'
print('\u8FDB')
[1] "<U+8FDB>"
[1] "<U+8FDB>"
-> Both are mangled, no matter if the print happens within evaluate or outside. So it seems a problem that sinked cons can't handle UTF-8 on non-UTF-8 systems? But that still doesn't explain the problem "coming in".
Knitr has the same problem:
# code cell marker broken to not confuse github...
` ``{r}
x = '法'
y = '\u8FDB'
print(nchar(x))
print(nchar(y))
print(x)
print(y)
x
y
` ``
Outputs:
## [1] 8 -> alread 8 chars when executed -> broken coming in
## [1] 1 -> escaping works for coming in
## [1] "<U+6CD5>" # -> just the broken 8 char string printed again
## [1] "<U+8FDB>" # -> printed in evaluate -> broken
## [1] "<U+6CD5>" # -> again the broken 8 char string
## [1] "<U+8FDB>" # -> printed by knit_print and broken there...
Another mentioning of this problem: https://stat.ethz.ch/pipermail/r-help//2014-April/373558.html (without an answer... :-()
And this also doesn't make a difference (in RStudio, win7):
> f <- textConnection("rval2", "w", local=TRUE, encoding = "UTF-8")
> sink(f)
> print('法')
> print('\u8FDB')
> sink()
> print(rval2)
[1] "[1] \"<U+6CD5>\"" "[1] \"<U+8FDB>\""
I once had a problem with encoding when the locale of my RStudio was not set. You find out by Sys.getlocale()
I'm not sure if this applies here though.
> Sys.getlocale()
[1] "LC_COLLATE=German_Germany.1252;LC_CTYPE=German_Germany.1252;LC_MONETARY=German_Germany.1252;LC_NUMERIC=C;LC_TIME=German_Germany.1252"
-> Looks ok to me (this is a plain RStudio session on win7)
The problem is that the executed code is UTF-8 (e.g. read from a file) and the output should also be UTF-8 (eg written to a file), so the locale should not matter?
I think there is something fundamentally broken (related to sink()) in base R on Windows. See #59. Basically if the characters are not supported by the system native encoding, they will be silently converted to <U+XXXX> sequences.
Any idea how to get out of this problem? This is basically a big problem in the current R kernel, because the "in" side of this problem prevents unicode code input on windows, resulting in wrong computations (e.g. probably wrong comparisons if one side comes from a file and one from a string defined in the notebook).
No, I think this requires one to dig deep into the C code in base R, and I don't really understand that level of details.
This is a problem, but probably not as big as you imagined. This only becomes a problem when a Windows user has characters in the document that his/her Windows native character encoding does not support. I think this is relatively rare. In the above examples, as long as your Windows supports the Chinese locale, you should be fine (Sys.setlocale(, 'Chinese')). As you mentioned, knitr suffers from the same problem, but over the four years, this issue has bitten users at most three times as far as I can remember.
Thank you for all your work on this. This comment is coming from someone still trying to understand the issue but I believe this is a problem for those of us who work with under-resourced languages, and linguists working with IPA data. If I've understood correctly, this issue is a problem for about half of the world's languages.
This is a problem, but probably not as big as you imagined.
@speechchemistry We had to wait for Windows to support UTF-8; see #59. R core has been making effort: https://developer.r-project.org/Blog/public/2020/05/02/utf-8-support-on-windows/ Eventually we should be able to forget about character encodings on Windows. Trust me: my native language is Chinese, and I have felt the enormous pain for many years, but still have to wait.
Hi, since R 4.2, it supports unicode on windows!