ip4s icon indicating copy to clipboard operation
ip4s copied to clipboard

Pure Scala IDNA implementation

Open armanbilge opened this issue 3 years ago • 11 comments

The forthcoming Scala Native cross-build is currently using icu4c for IDNA. Although this works, it introduces various build complexities (see https://github.com/scala-native/scala-native/issues/2778) and requires downstreams to have icu4c installed if they (indirectly) call IDNA related methods.

This motivates a pure Scala IDNA implementation.

  1. Does it make sense to do it in ip4s, or an external dependency (hosted where)? There is plenty of prior art to mimick. They all seem to use some form of source generator.

    • https://github.com/servo/rust-url/tree/master/idna
    • https://github.com/golang/net/tree/master/idna
    • https://github.com/kjd/idna
    • https://github.com/hexonet/idna-uts46
  2. Should we swap out the JVM and JS implementations as well?

    • Critically, the current JS implementation is broken/wrong, because it uses punycode.js without nameprep.

      The RFC specifically says:

      Implementations of IDNA MUST fully implement Nameprep and Punycode; neither Nameprep nor Punycode are optional.

      https://www.ietf.org/rfc/rfc3490.txt

      See also https://github.com/mathiasbynens/punycode.js/issues/40.

      In theory we could shell out to an npm package but dealing with npm from Scala.js is terrible. https://www.npmjs.com/package/idna-uts46

    • Meanwhile, the JVM implementation is using JDK APIs that are based on IDNA 2003 which IIUC is deprecated in favor of IDNA 2008. We could add a dependency to icu4j.

Thoughts? Since nobody is complaining about the existing broken/deprecated implementations, do we care enough to do this right?

armanbilge avatar Aug 15 '22 15:08 armanbilge

I'd love a pure Scala implementation that's used by JVM/JS/Native - preferably without any dependencies. We could do this either inline here or as a standalone zero dependency project under typelevel.

mpilquist avatar Aug 15 '22 16:08 mpilquist

http://www.unicode.org/reports/tr46/ is a good reference, including the mapping table and instructions for how to implement the conversions.

mpilquist avatar Aug 20 '22 13:08 mpilquist

I completely forgot that @isomarcte was also working on this issue for cats-uri. Did you get anywhere interesting with it? Here are some messages from Discord.

I'm starting to prototype some of the IDNA2003/IDNA2008/UTS 46 stuff on a cats-uri branch. We'll see where that goes.

Punting where ip4s integrates for a moment, I think we may want to pull out IDNA/UTS 46 code into a new micro library. It could be shared between cats-uri and ip4s.

If we want to support IDNA2008, then we might have to have a pure implementation 😢 .

We could probably target IDNA2003 and platform out, but big DNS implementations like bind are already supporting it, and not supporting it can have both correctness and security issues...(according to UTS 46). I mean IDNA2008 (which is from 2010 🤦‍♂️) is already 12 years old at this point

The security issues are very real. For example see https://github.com/libuv/libuv/pull/2046.


FWIW I have a branch that AFAIU implements "non-strict" IDNA 2008. Where by non-strict, I mean it doesn't reject invalid IDNs, but it does encode valid ones correctly to the best of my knowledge/understanding. This is fairly easy because IDNA 2008 doesn't specify a mapping step so it's basically just a Punycode implementation.

https://github.com/armanbilge/ip4s/tree/feature/idna

To make it strict we basically just need to load in this table of valid/invalid code points which would not be hard at all. https://www.unicode.org/Public/idna/idna2008derived/Idna2008-14.0.0.txt

But note that IDNA 2008 rejects all sorts of stuff. Consider:

Example.com
comcast。com
i❤.ws
☃.net

All four of those are valid under IDNA 2003 and UTS46 but rejected by IDNA 2008.

https://util.unicode.org/UnicodeJsps/idna.jsp?a=Example.com%0D%0Acomcast%E3%80%82com%0D%0Ai%E2%9D%A4.ws%0D%0A%E2%98%83.net

That's because IDNA 2008 rejects capital letters, emojis, and the (I think). Hell, there's even a real website at https://☃.net aka https://xn--n3h.net/. I have no idea what happens to it in an IDNA 2008 world.

So even though IDNA 2008 is 12 years old it just seems impractical to implement it without the UTS46 mapping step. At the very least it would be a breaking change compared to the IDNA 2003 currently used on the JVM. Moreover getting IDNs rejected for innocent stuff like capital letters seems extremely annoying (although IDNA 2008 has good reasons).


As @mpilquist points out implementing UTS46 is more-or-less straightforward algorithmically, it just requires loading a bunch of codepoints and mappings and following the steps. This requires some cleverness for both the source-generator (so as not to generate huge bytecode / JS sources) and for the in-memory data structure (so that its memory/time efficient).

I spent a stupid amount of time last week studying the UTS46 code-generators (all written in Python) for the various languages I linked above and trying to prototype my own. FTR If we just tweak one of those Python scripts to make Scala instead of JavaScript or whatever it's probably not too bad.

I got a bit tangled up trying to rewrite the generator in Scala (so it can be part of the sbt build) and trying to exorcise all the mutation from it. I also became frustrated because still all those languages rely on additional unicode data and normalizing algorithms that we currently don't have on Scala Native.

armanbilge avatar Aug 23 '22 23:08 armanbilge

Ah, this seems like a relevant branch: https://github.com/isomarcte/cats-uri/tree/bootstring

armanbilge avatar Aug 24 '22 00:08 armanbilge

:wave: Yeah, after bootstring it seems pretty simple to get to IDNA2003/UTS46/IDNA2008.

The bootstring implementation on that branch works, but needs a cleanup pass and some benchmarks. I was waiting to discuss this until I had a full IDNA2003/UTS46/IDNA2008 implementation done, but I don't think bootstring/IDNA should live in cats-uri. I was thinking either, two repos, one for bootstring one for IDNA/UTS46 or just one repo, maybe typelevel/idna and we could have bootstring and idna modules. Of the two, I like the latter option of just having a single new repo that both cats-uri and ip4vs can use. Technically bootstring is a general purpose algorithm, but I don't know why anyone would use it for anything other than IDNA (I'm personally aware of no other uses.).

FWIW I have a branch that AFAIU implements "non-strict" IDNA 2008. Where by non-strict, I mean it doesn't reject invalid IDNs, but it does encode valid ones correctly to the best of my knowledge/understanding. This is fairly easy because IDNA 2008 doesn't specify a mapping step so it's basically just a Punycode implementation.

https://github.com/armanbilge/ip4s/tree/feature/idna

@armanbilge I don't know how far along this is or how it compares to what I have on the cats-uri branch. How do you want to proceed here?

FWIW, cats-uri is reasonably far along, but blocked by issues in case-insensitive. I think we've got those mostly fixed, I've just not been sure what to do about the glob matcher. I'm not certain a correct implementation of this can exist, as it assumes that any two caseless unicode strings have the same UTF-16 char length, which is not true (big old quagmire in there). If we were okay dropping the globbing, I could get case-insensitive 2.x.x wrapped pretty quick.

Thoughts?

isomarcte avatar Aug 26 '22 15:08 isomarcte

@isomarcte thank you so much for chiming in! That's fantastic news :)

I don't know how far along this is or how it compares to what I have on the cats-uri branch. How do you want to proceed here?

I just did a dumb Java -> Scala translation of the Punycode implementation in icu4j. What you have is better ... much better 😆

With that said, I think creating a typelevel/idna repo that can be shared makes sense to me. I can set this up and migrate your bootstring work, how does that sound?

I was waiting to discuss this until I had a full IDNA2003/UTS46/IDNA2008 implementation done

I don't think we need IDNA 2003 really. UTS46+IDNA 2008 are good though. So sounds like you are planning to work on this? No rush, just happy someone has a plan for it :)


FWIW, cats-uri is reasonably far along

This is great to hear, thanks for all your work on that project :) regarding the glob issues, would you mind opening an issue on case-insensitive about it?

armanbilge avatar Aug 26 '22 15:08 armanbilge

I like the way the plan is shaping up! If we have a typelevel/idna repo, would we need anything in ip4s to integrate with it? Or would we just deprecate com.comcast.ip4s.IDN in favor of some new type in typelevel/idna?

mpilquist avatar Aug 26 '22 16:08 mpilquist

I was imagining that typelevel/idna wouldn't introduce any new datatypes but provide implementations of the toAscii and toUnicode methods. In which case we would keep the IDN type here, just swap out the implementations here.

https://github.com/Comcast/ip4s/blob/d84a66250630cbc4d99906462fadc967bc9165ad/jvm/src/main/scala/com/comcast/ip4s/IDNCompanionPlatform.scala#L22-L26

But maybe @isomarcte has something else in mind? :)

armanbilge avatar Aug 26 '22 16:08 armanbilge

Oh I see, okay!

mpilquist avatar Aug 26 '22 16:08 mpilquist

My general approach has been to have both pure functions on the underlying type, as you have above @armanbilge, and also to have newtypes, e.g.

sealed abstract class IDNA2008 {
  def asUnicode: String
  def asAscii: String
}

object IDNA2008 {
  def fromAscii(value: String): Either[String, IDNA2008] = decode(value)
  def fromUnicode(value: String): Either[String IDNA2008] = encode(value)
  def toAscii(unicode: String): Either[String, String] = decodeRaw(value)
  def toUnicode(ascii: String): Either[String, String] = encodeRaw(value)
}

Something like that, and then let the user's decide how deep they want to get into newtypes. So I think it should be easy to integrate it.

isomarcte avatar Aug 26 '22 16:08 isomarcte

I extracted a new repository from @isomarcte's bootstring branch and got it publishing snapshots.

https://github.com/typelevel/idna4s (apologies the "4s" just slipped out 😜 )

Everything subject to change/bikeshed of course :) I wanted to get something up so that we have some options for cross-building ip4s to Scala Native.

I think as a short-term solution we could "implement" IDN on Native just Punycode ... this is what the JS implementation is currently doing here /shrug. I'm no expert but I don't think there are security issues: IIUC without the mapping step, some otherwise valid IDNs would be transformed to ASCII that are invalid/impossible domains. Bad UX though.

Furthermore, there aren't real bincompat concerns, considering:

  1. low-adoption of Native
  2. unlikeliness of idna4s to be involved in a triangle at this early stage
  3. that the Native ecosystem will break in the near future anyway with the 0.5.x release

armanbilge avatar Aug 26 '22 23:08 armanbilge