How to handle undefined conversions?
I was wondering how to replace undefined conversions by a substitute character when they are outside of the destination encoding, e.g. when I try to convert the euro sign (€) to SHIFT JIS encoding.
In Ruby, we can do this:
"xx€xx".encode('SHIFT_JIS', 'UTF-8', undef: :replace)
=> "xx?xx"
And the € which cannot be converted is replaced by a "?" character. This is important when doing text comparison i.e. https://unicode.org/reports/tr36/#Text_Comparison
When converting charsets, never simply omit characters that cannot be converted; at least substitute U+FFFD (when converting to Unicode) or 0x1A (when converting to bytes) to reduce security problems.
Can we do this using iconv library in Elixir/Erlang? Currently the undefined character is omitted. I guess I could do the conversion char by char and check if it returns an empty string but I was hoping if there is anything more elegant possible?
If any one stumbles upon this, I'm using this to handle the case above, though it does call :iconv.convert for every character.
defp to_shift_jis(input) do
convert = fn x ->
case :iconv.convert("utf-8", "shift-jis", <<x::utf8>>) do
"" -> "?"
c -> c
end
end
for <<c::utf8 <- input>>, do: convert.(c), into: ""
end