iconv icon indicating copy to clipboard operation
iconv copied to clipboard

How to handle undefined conversions?

Open krepflap opened this issue 6 years ago • 1 comments

I was wondering how to replace undefined conversions by a substitute character when they are outside of the destination encoding, e.g. when I try to convert the euro sign (€) to SHIFT JIS encoding.

In Ruby, we can do this:

"xx€xx".encode('SHIFT_JIS', 'UTF-8', undef: :replace)
=> "xx?xx"

And the € which cannot be converted is replaced by a "?" character. This is important when doing text comparison i.e. https://unicode.org/reports/tr36/#Text_Comparison

When converting charsets, never simply omit characters that cannot be converted; at least substitute U+FFFD (when converting to Unicode) or 0x1A (when converting to bytes) to reduce security problems.

Can we do this using iconv library in Elixir/Erlang? Currently the undefined character is omitted. I guess I could do the conversion char by char and check if it returns an empty string but I was hoping if there is anything more elegant possible?

krepflap avatar Jun 18 '19 07:06 krepflap

If any one stumbles upon this, I'm using this to handle the case above, though it does call :iconv.convert for every character.

  defp to_shift_jis(input) do
    convert = fn x ->
      case :iconv.convert("utf-8", "shift-jis", <<x::utf8>>) do
        "" -> "?"
        c -> c
      end
    end

    for <<c::utf8 <- input>>, do: convert.(c), into: ""
  end

krepflap avatar Jun 28 '19 16:06 krepflap