Any idea which encoding failure could cause "beëindiging" to be printed in a letter as "beï¿œindiging"?

Open jpluimers opened this issue 3 years ago • 1 comments

Currently ftfy cannot explain this:

>>> ftfy.fix_and_explain("beï¿œindiging")
ExplainedText(text='beï¿œindiging', explanation=[])

It is a misrepresentation of "beëindiging" as "beï¿œindiging".

A long time ago I tried to find the cause myself, but couldn't: [NL] encoding blijft moeilijk, waarom toch? (dit keer in een brief van @xs4all)

Any idea?

Apr 24 '22 17:04 jpluimers

The fact that one letter is replaced by more than two leads on the path that is was not a simple iteration of UTF-8 data decoded incorrectly with one pass of a single byte character encoding.

Either:

Two passes of decoding error with one introducing non printed control characters. (e.g. ë -> ï¿œ <control>)
One pass with a character encoding using 3-bytes for "ë".
Something else.

Didn't search for 2 but in your initial analysis linked you mentioned that the email received was encoded with ISO-8859-15. Looking at the Mojibake characters through the lens of ISO-8859-15:

ï => 0xef
¿ => 0xbf
œ => 0xbd

(in your linked analysis "ï" is incorrectly mapped to its upper case variant "Ï"=0xcf).

The sequence of bytes 0xefbfbd is the replacement character in UTF-8 (�), giving that this Mojibake was created with a pipeline similar to:

$ echo -n "ë" | 
  iconv -t cp1252 | # This step is pure speculation on which encoding was used, it could be several single byte encodings such as iso-8859-1
  uconv --from-callback substitute -f utf8 | 
  iconv -f iso885915
ï¿œ

Note: above example requires a system using UTF-8 for the terminal.

ftfy won't be able to recover such string as the original data is lost when the replacement character overwrites the original bytes.

Feb 10 '23 16:02 Perdjesk