snet-cli icon indicating copy to clipboard operation
snet-cli copied to clipboard

Encoding issues

Open arturgontijo opened this issue 7 years ago • 10 comments

If I run snet in a Docker container I get an encoding error:

# snet organization info ramonduraes
Error: 'ascii' codec can't encode character '\xe3' in position 32: ordinal not in range(128)
If you want to see full Traceback then run:
snet --print-traceback [parameters]

# locale
LANG=
LANGUAGE=
LC_CTYPE="POSIX"
LC_NUMERIC="POSIX"
LC_TIME="POSIX"
LC_COLLATE="POSIX"
LC_MONETARY="POSIX"
LC_MESSAGES="POSIX"
LC_PAPER="POSIX"
LC_NAME="POSIX"
LC_ADDRESS="POSIX"
LC_TELEPHONE="POSIX"
LC_MEASUREMENT="POSIX"
LC_IDENTIFICATION="POSIX"
LC_ALL=

I can set locale in Dockerfile and also force it within python:

# PYTHONIOENCODING=utf-8 snet organization info ramonduraes

Organization Name:
 - Ramon Durães

Organization Id:
 - ramonduraes

Owner:
 - 0xCEb196e0236C5B4EE62d5C87692284FBd52fCBD0

I'd like to suggest an UnicodeEncodeError handler.

One approach is using stdout.buffer, with it snet will become console's encoding independent. Example commands.py#L39:

    def _printout(self, message):
        if self.out_f is not None:
                try:
                        print(message, file=self.out_f)
                except UnicodeEncodeError:
                        sys.stdout.buffer.write(message.encode("utf-8"))

arturgontijo avatar Jan 22 '19 10:01 arturgontijo

Looks fine for me... But it should be sys.stdout.buffer.write((message + "\n").encode("utf-8")) or something like this (we need add \n)..

astroseger avatar Jan 22 '19 10:01 astroseger

was fixed by #184

astroseger avatar Jan 22 '19 16:01 astroseger

I am concerned we are introducing a problem for people. If the console doesn't support UTF-8, we shouldn't force it. I think the correct option is to configure the terminal correctly.

One way to avoid error would be to use message.encode('ascii', 'ignore'), while not forcing utf-8 encoding on a terminal that doesn't support it. This throws away characters that can't be represented.

A better way to avoid error is to use unidecode to replace undisplayable unicode characters with similar ascii characters.

ferrouswheel avatar Jan 22 '19 21:01 ferrouswheel

I am concerned we are introducing a problem for people. If the console doesn't support UTF-8, we shouldn't force it. I think the correct option is to configure the terminal correctly.

Like pointing this out for users in the README?

One way to avoid error would be to use message.encode('ascii', 'ignore'), while not forcing utf-8 encoding on a terminal that doesn't support it. This throws away characters that can't be represented.

Using "ignore" is fine but I think we also should try "utf-8" before that. Like happened to me in my Docker container, I've got "utf-8" with the new PR. With "ignore" it won't break but, in my case, I'll get empty chars while my console does support "utf-8". What do you think?

A better way to avoid error is to use unidecode to replace undisplayable unicode characters with similar ascii characters.

This should be more elegant but I think we must try to use "utf-8" before that too.

arturgontijo avatar Jan 22 '19 22:01 arturgontijo

@arturgontijo I think we should try to print utf-8 encoding by default, but in the exception handling choose one of:

  • use ignore
  • use unidecode
  • print a message saying they should set their terminal to UTF-8, and then exit
  • on the first exception, print a message saying they should set their terminal to UTF-8, then for future exceptions use ignore/unidecode.

A complete solution would be to inspect the terminal environment to decide the correct encoding, but I'm not sure how easy/complex those settings are to get right.

ferrouswheel avatar Jan 22 '19 22:01 ferrouswheel

yes. It seems we hurried with the decision. I would vote for the following solution: Intercept "UnicodeEncodeError" and do the following:

  • print a warning that terminal do not support utf-8 (in stderr with # prefix)
  • print message.encode('ascii', 'ignore'))

astroseger avatar Jan 23 '19 10:01 astroseger

What will happen if we force utf-8 characters on terminal which do not support utf-8? Maybe it was not a bad decision after all? We could force utf-8 but print a warning..

astroseger avatar Jan 23 '19 10:01 astroseger

I still think that we must "force" the utf-8 too. A lot of modern python tools rely on that premise (user's console supports it). If we dive into this the code will have multiple exceptions, example:

Console 1 (does support utf-8 but it is not set) Console 2 (doesn't support utf-8 at all)

    @staticmethod
    def _print(message, fd):
        message = str(message) + "\n"
        try:
            fd.write(message)
        except UnicodeEncodeError: # <--- Will rise for 1 and 2
            if hasattr(fd, "buffer"):
                fd.buffer.write(message.encode("utf-8")) # <---- Will trow another exception just for 2, right?
            else:
                raise

What will happen if we force utf-8 characters on terminal which do not support utf-8?

I'm kind of lost here, because I can't figure out how to simulate it.

arturgontijo avatar Jan 23 '19 11:01 arturgontijo

What will happen if we force utf-8 characters on terminal which do not support utf-8?

I'm kind of lost here, because I can't figure out how to simulate it.

I imagine that you'd get some unexpected characters or junk. But I'm unsure.

I was thinking of something like:

    utf8_warning_printed = False
    @staticmethod
    def _print(message, fd):
        try:
            fd.buffer.write(str(message).encode("utf-8") + "\n")
        except UnicodeEncodeError:
            if not utf8_warning_printed:
                fd.buffer.write("Error attempting to encode unicode! Please ensure your locale supports UTF-8 if you want to see correct character representations.")
                utf8_warning_printed = True
            fd.write(str(message).encode("ascii", "ignore") + "\n")

For "Console 1 (does support utf-8 but it is not set)" - I'm pretty sure we can't detect or fix this if the user has the wrong setting. So I think we should just warn them...

If we start trying to be clever I think we'll make things much harder to debug for ourselves (and for our users trying to configure things!).

ferrouswheel avatar Jan 23 '19 20:01 ferrouswheel

Ok, I think I've got the idea.

    @staticmethod
    def _print(message, fd):
        message = str(message) + "\n"
        try:
            fd.write(message)
        except UnicodeEncodeError:
            fd.write("Error attempting to encode unicode! Please ensure your locale supports UTF-8 if you want to see correct character representations.")
            message = message.encode("ascii", "ignore")
            fd.write(message.decode())

I'm not using utf8_warning_printed because snet-cli runs and exits at once.

About the warning, maybe we can show: "Error attempting to encode unicode! Please ensure your locale is set to support UTF-8 if you want to see correct character representations."

arturgontijo avatar Jan 24 '19 11:01 arturgontijo