coreutils icon indicating copy to clipboard operation
coreutils copied to clipboard

`echo` does not print emoji given escape sequence `\xf0\x9f\x98\x82`

Open kkew3 opened this issue 1 year ago • 4 comments

How to reproduce

cargo run -p uu_echo -- -e '\xf0\x9f\x98\x82'

gives 😂.

Expected behavior

Under Ubuntu 22.04, with /bin/echo --version:

echo (GNU coreutils) 8.32
Copyright (C) 2020 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <https://gnu.org/licenses/gpl.html>.
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.

Written by Brian Fox and Chet Ramey.

the output is 😂, given command /bin/echo -e '\xf0\x9f\x98\x82'.

kkew3 avatar Sep 26 '24 07:09 kkew3

I'm glad to submit a PR.

Proposal

I identify the issue as returning char at parse_code function:

https://github.com/uutils/coreutils/blob/a0d258d3f29cbe6b6714b4758554dba0e84264c8/src/uu/echo/src/echo.rs#L42

A u8 should be returned instead, since it occurs that multiple bytes constitute one Unicode char. A possible solution is to maintain a 4-byte buffer, and repeatedly check for valid utf-8 character from it upon every character read from input, using String::from_utf8:

/// A buffer used to interpret bytes as Unicode characters.
struct TryUnicodeBuffer {
    bytes: [u8; 4],
    len: usize,
}

impl TryUnicodeBuffer {
    /// Push and attempt to convert the buffer into Unicode characters, which
    /// are written to `output`. Panic if the buffer is already full, which
    /// shouldn't happen normally. After `push`, it's guaranteed that the
    /// remaining bytes do not make up a valid utf-8 character.
    fn push(&mut self, i: u8, mut output: impl Write) -> io::Result<()> {}

    /// Try to interpret the bytes started at position `start` as a Unicode
    /// character.
    fn to_char(&self, start: usize) -> Option<char> {}

    /// Clear the remaining (invalid) bytes and replace with the replacement
    /// characters if not empty.
    fn clear(&mut self, mut output: impl Write) -> io::Result<()> {}

    /// Clear and push something that can be interpreted as a Unicode
    /// character.
    fn clear_push(&mut self, i: impl Into<char>, mut output: impl Write) -> io::Result<()> {}
}

Only print_escaped function needs to be modified.

Test cases

  • The MRE of this issue: echo -e '\xf0\x9f\x98\x82' should yield 😂.
  • ASCII and emoji: echo -e '\x41\xf0\x9f\x98\x82\x42' should yield A😂B.
  • The emoji broken by an ASCII: echo -e '\xf0\x41\x9f\x98\x82' should yield �A���.
  • Tests involving letter escape character; e.g. echo -e '\x41\xf0\c\x9f\x98\x82' should yield A� (no newline).

kkew3 avatar Sep 26 '24 14:09 kkew3

https://github.com/uutils/coreutils/pull/6803 should fix this. I went with a simpler fix. Since everything is being printed to stdout, which is obviously not restricted to UTF-8 data, the escape codes can just be printed out byte by byte, without trying to keep track of whether the output is valid UTF-8.

andrewliebenow avatar Oct 20 '24 16:10 andrewliebenow

Yeah, it's definitely a better fix.

It also comes to my mind that all of these, tested on ubuntu 22.04, /bin/echo -e '\xf0', /bin/echo -e '\xf0\x9f', /bin/echo -e '\xf0\x9f\x98', should yield the same (the unicode replacement character \u{FFFD}), which seems to break your code. But it follows immediately that /bin/echo -n -e '\xf0' | wc -c, /bin/echo -n -e '\xf0\x9f' | wc -c, /bin/echo -n -e '\xf0\x9f\x98' | wc -c prints 1, 2 and 3 respectively. In contrast, (zsh) builtin echo -n -e '\uFFFD' | wc -c prints 3. This shows that the printed could origin from the terminal's rendering, where in iTerm2.app the byte sequences are rendered as the , and in Terminal.app the ?.

Great! I'll close this issue.

kkew3 avatar Oct 21 '24 03:10 kkew3

A nice way to debug these issues is to use bat -A:

❯ echo -e '\xf0\x41\x9f\x98\x82' | bat -A  
───────┬────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
       │ STDIN
───────┼────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
   1   │ \xF0A\x9F\x98\x82␊
───────┴────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────

https://github.com/sharkdp/bat

Otherwise, yes, you can't really tell what's going on when you're dealing with weird/non-UTF-8 output.

Technically I don't think this issue should be closed until a PR resolving the bug has been merged, but I'll be checking in on my PR periodically until it's merged, so it shouldn't matter much.

andrewliebenow avatar Oct 21 '24 05:10 andrewliebenow

Fixed in https://github.com/uutils/coreutils/pull/6803

@kkew3 thanks for reporting!

cakebaker avatar Oct 22 '24 09:10 cakebaker