Any idea which encoding failure could cause "beëindiging" to be printed in a letter as "beᅵindiging"?
Currently ftfy cannot explain this:
>>> ftfy.fix_and_explain("beᅵindiging")
ExplainedText(text='beᅵindiging', explanation=[])
It is a misrepresentation of "beëindiging" as "beᅵindiging".
A long time ago I tried to find the cause myself, but couldn't: [NL] encoding blijft moeilijk, waarom toch? (dit keer in een brief van @xs4all)
Any idea?
The fact that one letter is replaced by more than two leads on the path that is was not a simple iteration of UTF-8 data decoded incorrectly with one pass of a single byte character encoding.
Either:
- Two passes of decoding error with one introducing non printed control characters. (e.g. ë -> ᅵ <control>)
- One pass with a character encoding using 3-bytes for "ë".
- Something else.
Didn't search for 2 but in your initial analysis linked you mentioned that the email received was encoded with ISO-8859-15. Looking at the Mojibake characters through the lens of ISO-8859-15:
- ï => 0xef
- ¿ => 0xbf
- œ => 0xbd
(in your linked analysis "ï" is incorrectly mapped to its upper case variant "Ï"=0xcf).
The sequence of bytes 0xefbfbd is the replacement character in UTF-8 (�), giving that this Mojibake was created with a pipeline similar to:
$ echo -n "ë" |
iconv -t cp1252 | # This step is pure speculation on which encoding was used, it could be several single byte encodings such as iso-8859-1
uconv --from-callback substitute -f utf8 |
iconv -f iso885915
ᅵ
Note: above example requires a system using UTF-8 for the terminal.
ftfy won't be able to recover such string as the original data is lost when the replacement character overwrites the original bytes.