`echo` does not print emoji given escape sequence `\xf0\x9f\x98\x82`
How to reproduce
cargo run -p uu_echo -- -e '\xf0\x9f\x98\x82'
gives ð.
Expected behavior
Under Ubuntu 22.04, with /bin/echo --version:
echo (GNU coreutils) 8.32
Copyright (C) 2020 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <https://gnu.org/licenses/gpl.html>.
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Written by Brian Fox and Chet Ramey.
the output is 😂, given command /bin/echo -e '\xf0\x9f\x98\x82'.
I'm glad to submit a PR.
Proposal
I identify the issue as returning char at parse_code function:
https://github.com/uutils/coreutils/blob/a0d258d3f29cbe6b6714b4758554dba0e84264c8/src/uu/echo/src/echo.rs#L42
A u8 should be returned instead, since it occurs that multiple bytes constitute one Unicode char. A possible solution is to maintain a 4-byte buffer, and repeatedly check for valid utf-8 character from it upon every character read from input, using String::from_utf8:
/// A buffer used to interpret bytes as Unicode characters.
struct TryUnicodeBuffer {
bytes: [u8; 4],
len: usize,
}
impl TryUnicodeBuffer {
/// Push and attempt to convert the buffer into Unicode characters, which
/// are written to `output`. Panic if the buffer is already full, which
/// shouldn't happen normally. After `push`, it's guaranteed that the
/// remaining bytes do not make up a valid utf-8 character.
fn push(&mut self, i: u8, mut output: impl Write) -> io::Result<()> {}
/// Try to interpret the bytes started at position `start` as a Unicode
/// character.
fn to_char(&self, start: usize) -> Option<char> {}
/// Clear the remaining (invalid) bytes and replace with the replacement
/// characters if not empty.
fn clear(&mut self, mut output: impl Write) -> io::Result<()> {}
/// Clear and push something that can be interpreted as a Unicode
/// character.
fn clear_push(&mut self, i: impl Into<char>, mut output: impl Write) -> io::Result<()> {}
}
Only print_escaped function needs to be modified.
Test cases
- The MRE of this issue:
echo -e '\xf0\x9f\x98\x82'should yield😂. - ASCII and emoji:
echo -e '\x41\xf0\x9f\x98\x82\x42'should yieldA😂B. - The emoji broken by an ASCII:
echo -e '\xf0\x41\x9f\x98\x82'should yield�A���. - Tests involving letter escape character; e.g.
echo -e '\x41\xf0\c\x9f\x98\x82'should yieldA�(no newline).
https://github.com/uutils/coreutils/pull/6803 should fix this. I went with a simpler fix. Since everything is being printed to stdout, which is obviously not restricted to UTF-8 data, the escape codes can just be printed out byte by byte, without trying to keep track of whether the output is valid UTF-8.
Yeah, it's definitely a better fix.
It also comes to my mind that all of these, tested on ubuntu 22.04, /bin/echo -e '\xf0', /bin/echo -e '\xf0\x9f', /bin/echo -e '\xf0\x9f\x98', should yield the same � (the unicode replacement character \u{FFFD}), which seems to break your code. But it follows immediately that /bin/echo -n -e '\xf0' | wc -c, /bin/echo -n -e '\xf0\x9f' | wc -c, /bin/echo -n -e '\xf0\x9f\x98' | wc -c prints 1, 2 and 3 respectively. In contrast, (zsh) builtin echo -n -e '\uFFFD' | wc -c prints 3. This shows that the printed � could origin from the terminal's rendering, where in iTerm2.app the byte sequences are rendered as the �, and in Terminal.app the ?.
Great! I'll close this issue.
A nice way to debug these issues is to use bat -A:
❯ echo -e '\xf0\x41\x9f\x98\x82' | bat -A
───────┬────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
│ STDIN
───────┼────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
1 │ \xF0A\x9F\x98\x82␊
───────┴────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
https://github.com/sharkdp/bat
Otherwise, yes, you can't really tell what's going on when you're dealing with weird/non-UTF-8 output.
Technically I don't think this issue should be closed until a PR resolving the bug has been merged, but I'll be checking in on my PR periodically until it's merged, so it shouldn't matter much.
Fixed in https://github.com/uutils/coreutils/pull/6803
@kkew3 thanks for reporting!