python-ftfy icon indicating copy to clipboard operation
python-ftfy copied to clipboard

Any idea which encoding failure could cause "beëindiging" to be printed in a letter as "beᅵindiging"?

Open jpluimers opened this issue 3 years ago • 1 comments

Currently ftfy cannot explain this:

>>> ftfy.fix_and_explain("beᅵindiging")
ExplainedText(text='beᅵindiging', explanation=[])

It is a misrepresentation of "beëindiging" as "beᅵindiging".

A long time ago I tried to find the cause myself, but couldn't: [NL] encoding blijft moeilijk, waarom toch? (dit keer in een brief van @xs4all)

Any idea?

jpluimers avatar Apr 24 '22 17:04 jpluimers

The fact that one letter is replaced by more than two leads on the path that is was not a simple iteration of UTF-8 data decoded incorrectly with one pass of a single byte character encoding.

Either:

  1. Two passes of decoding error with one introducing non printed control characters. (e.g. ë -> ᅵ <control>)
  2. One pass with a character encoding using 3-bytes for "ë".
  3. Something else.

Didn't search for 2 but in your initial analysis linked you mentioned that the email received was encoded with ISO-8859-15. Looking at the Mojibake characters through the lens of ISO-8859-15:

  • ï => 0xef
  • ¿ => 0xbf
  • œ => 0xbd

(in your linked analysis "ï" is incorrectly mapped to its upper case variant "Ï"=0xcf).

The sequence of bytes 0xefbfbd is the replacement character in UTF-8 (�), giving that this Mojibake was created with a pipeline similar to:

$ echo -n "ë" | 
  iconv -t cp1252 | # This step is pure speculation on which encoding was used, it could be several single byte encodings such as iso-8859-1
  uconv --from-callback substitute -f utf8 | 
  iconv -f iso885915
ᅵ

Note: above example requires a system using UTF-8 for the terminal.

ftfy won't be able to recover such string as the original data is lost when the replacement character overwrites the original bytes.

Perdjesk avatar Feb 10 '23 16:02 Perdjesk