Encoding issues
If I run snet in a Docker container I get an encoding error:
# snet organization info ramonduraes
Error: 'ascii' codec can't encode character '\xe3' in position 32: ordinal not in range(128)
If you want to see full Traceback then run:
snet --print-traceback [parameters]
# locale
LANG=
LANGUAGE=
LC_CTYPE="POSIX"
LC_NUMERIC="POSIX"
LC_TIME="POSIX"
LC_COLLATE="POSIX"
LC_MONETARY="POSIX"
LC_MESSAGES="POSIX"
LC_PAPER="POSIX"
LC_NAME="POSIX"
LC_ADDRESS="POSIX"
LC_TELEPHONE="POSIX"
LC_MEASUREMENT="POSIX"
LC_IDENTIFICATION="POSIX"
LC_ALL=
I can set locale in Dockerfile and also force it within python:
# PYTHONIOENCODING=utf-8 snet organization info ramonduraes
Organization Name:
- Ramon Durães
Organization Id:
- ramonduraes
Owner:
- 0xCEb196e0236C5B4EE62d5C87692284FBd52fCBD0
I'd like to suggest an UnicodeEncodeError handler.
One approach is using stdout.buffer, with it snet will become console's encoding independent.
Example commands.py#L39:
def _printout(self, message):
if self.out_f is not None:
try:
print(message, file=self.out_f)
except UnicodeEncodeError:
sys.stdout.buffer.write(message.encode("utf-8"))
Looks fine for me... But it should be sys.stdout.buffer.write((message + "\n").encode("utf-8")) or something like this (we need add \n)..
was fixed by #184
I am concerned we are introducing a problem for people. If the console doesn't support UTF-8, we shouldn't force it. I think the correct option is to configure the terminal correctly.
One way to avoid error would be to use message.encode('ascii', 'ignore'), while not forcing utf-8 encoding on a terminal that doesn't support it. This throws away characters that can't be represented.
A better way to avoid error is to use unidecode to replace undisplayable unicode characters with similar ascii characters.
I am concerned we are introducing a problem for people. If the console doesn't support UTF-8, we shouldn't force it. I think the correct option is to configure the terminal correctly.
Like pointing this out for users in the README?
One way to avoid error would be to use
message.encode('ascii', 'ignore'), while not forcing utf-8 encoding on a terminal that doesn't support it. This throws away characters that can't be represented.
Using "ignore" is fine but I think we also should try "utf-8" before that.
Like happened to me in my Docker container, I've got "utf-8" with the new PR.
With "ignore" it won't break but, in my case, I'll get empty chars while my console does support "utf-8".
What do you think?
A better way to avoid error is to use unidecode to replace undisplayable unicode characters with similar ascii characters.
This should be more elegant but I think we must try to use "utf-8" before that too.
@arturgontijo I think we should try to print utf-8 encoding by default, but in the exception handling choose one of:
- use ignore
- use unidecode
- print a message saying they should set their terminal to UTF-8, and then exit
- on the first exception, print a message saying they should set their terminal to UTF-8, then for future exceptions use ignore/unidecode.
A complete solution would be to inspect the terminal environment to decide the correct encoding, but I'm not sure how easy/complex those settings are to get right.
yes. It seems we hurried with the decision. I would vote for the following solution: Intercept "UnicodeEncodeError" and do the following:
- print a warning that terminal do not support utf-8 (in stderr with
#prefix) - print
message.encode('ascii', 'ignore'))
What will happen if we force utf-8 characters on terminal which do not support utf-8? Maybe it was not a bad decision after all? We could force utf-8 but print a warning..
I still think that we must "force" the utf-8 too.
A lot of modern python tools rely on that premise (user's console supports it).
If we dive into this the code will have multiple exceptions, example:
Console 1 (does support utf-8 but it is not set) Console 2 (doesn't support utf-8 at all)
@staticmethod
def _print(message, fd):
message = str(message) + "\n"
try:
fd.write(message)
except UnicodeEncodeError: # <--- Will rise for 1 and 2
if hasattr(fd, "buffer"):
fd.buffer.write(message.encode("utf-8")) # <---- Will trow another exception just for 2, right?
else:
raise
What will happen if we force utf-8 characters on terminal which do not support utf-8?
I'm kind of lost here, because I can't figure out how to simulate it.
What will happen if we force utf-8 characters on terminal which do not support utf-8?
I'm kind of lost here, because I can't figure out how to simulate it.
I imagine that you'd get some unexpected characters or junk. But I'm unsure.
I was thinking of something like:
utf8_warning_printed = False
@staticmethod
def _print(message, fd):
try:
fd.buffer.write(str(message).encode("utf-8") + "\n")
except UnicodeEncodeError:
if not utf8_warning_printed:
fd.buffer.write("Error attempting to encode unicode! Please ensure your locale supports UTF-8 if you want to see correct character representations.")
utf8_warning_printed = True
fd.write(str(message).encode("ascii", "ignore") + "\n")
For "Console 1 (does support utf-8 but it is not set)" - I'm pretty sure we can't detect or fix this if the user has the wrong setting. So I think we should just warn them...
If we start trying to be clever I think we'll make things much harder to debug for ourselves (and for our users trying to configure things!).
Ok, I think I've got the idea.
@staticmethod
def _print(message, fd):
message = str(message) + "\n"
try:
fd.write(message)
except UnicodeEncodeError:
fd.write("Error attempting to encode unicode! Please ensure your locale supports UTF-8 if you want to see correct character representations.")
message = message.encode("ascii", "ignore")
fd.write(message.decode())
I'm not using utf8_warning_printed because snet-cli runs and exits at once.
About the warning, maybe we can show:
"Error attempting to encode unicode! Please ensure your locale is set to support UTF-8 if you want to see correct character representations."