reboot icon indicating copy to clipboard operation
reboot copied to clipboard

Emoji character in domains not recognized correctly (punycode)

Open redsk opened this issue 5 years ago • 8 comments

val u = "https://i❤.ws/"
url(u).url

leads to

java.net.URISyntaxException: Expected authority at index 7: http://
  java.net.URI$Parser.fail(URI.java:2848)
  java.net.URI$Parser.failExpecting(URI.java:2854)
  java.net.URI$Parser.parseHierarchical(URI.java:3102)
  java.net.URI$Parser.parse(URI.java:3053)
  java.net.URI.<init>(URI.java:588)
  dispatch.RawUri.toUri(uri.scala:23)
  dispatch.RawUri.toString(uri.scala:24)
  dispatch.url$.$anonfun$apply$3(requests.scala:102)
  dispatch.Req.requestBuilder$1(requests.scala:48)
  dispatch.Req.toRequestBuilder(requests.scala:53)
  dispatch.Req.toRequest(requests.scala:61)
  dispatch.UrlVerbs.url(requests.scala:125)
  dispatch.UrlVerbs.url$(requests.scala:125)
  dispatch.Req.url(requests.scala:17)

Its punycode [1] conversion [2] (https://xn--i-7iq.ws/) works as expected.

https://i❤.ws/ points to a domain registrar and is SFW.

"https://i❤.ws/" visualization when not between backticks.

Opening a new issue because I cannot reopen an issue closed by a collaborator.

[1] https://en.wikipedia.org/wiki/Punycode [2] used converter: https://www.punycoder.com/

redsk avatar May 04 '20 14:05 redsk

In case you're wondering, java.net.IDN.toASCII(u) fails with

java.lang.IllegalArgumentException: java.text.ParseException: An unassigned code point was found in the input

as it only supports IDNA2003. For emoji IDNA2008 is needed and the icu4j library [1] supports it:

import com.ibm.icu.text.IDNA
val uts46 = IDNA.getUTS46Instance(IDNA.DEFAULT)

val u = "i❤.ws/"
val punycodedDomain = uts46.nameToASCII(u, new java.lang.StringBuilder(), new IDNA.Info()).toString
// punycodedDomain == "xn--i-7iq.ws/"

Apparently, nameToASCII works with domain names and paths but not with protocol, that has to be stripped off before the conversion.

[1] https://mvnrepository.com/artifact/com.ibm.icu/icu4j

redsk avatar May 04 '20 16:05 redsk

I've given a more thorough explanation of the same issue here along with a PR. Maybe you'd like to take the same approach.

redsk avatar May 06 '20 02:05 redsk

icu4j is using the ICU license. This seems to be a tweaked version of the X11 License which is compatible with our own LGPL.

farmdawgnation avatar May 25 '20 16:05 farmdawgnation

Yeah as you wrote, they use the ICU license which is deemed compatible with GPL. So it should be ok, no?

redsk avatar May 25 '20 16:05 redsk

Yeah, that's good. I was just making a note here as I worked through figuring out how I want to integrate it. No action needed.

farmdawgnation avatar May 25 '20 16:05 farmdawgnation

Ok, this is a lot more complex than I originally thought looking at the bug report. TL;DR there's no good way for us to support this type of encoding from the url helper because we lean heavily on the URL parsing provided by the JVM by default. That, in turn, is based on a regex that doesn't properly support this type of encoding.

The good news is that there's an existing API which will handle this correctly: host:

@ host("i❤.ws").url
res38: String = "http://xn--i-7iq.ws/"

This API works a bit differently because it doesn't accept a full URL, but that's the same thing that makes the IDN conversion work as expected. For example, to get https you would use:

@ host("i❤.ws").secure.url
res40: String = "https://xn--i-7iq.ws/"

In order to support this from the url API we'd need to re-implement breaking up URLs into their component parts for parsing. I'm willing to investigate that, but it's a much larger project because we've got to be sure we don't incidentally break something else.

Does this unblock your use case for the time being?

farmdawgnation avatar May 25 '20 17:05 farmdawgnation

Well, if I understand correctly, the host API would work only for hosts and not for full URLs which of course would be rather inconvenient as the parsing would be needed to be done by the caller.

My use case is not blocked as I use a different library to detect URLs in strings and the library can do URL normalization, including using ICU (I did the PR), so I simply pass dispatch the normalized URL.

I created this issue because I believe that other users of Dispatch might have the same problem and because I think this library should be able to handle URL normalization properly (not only emojis, also other characters, as illustrated here), which the current url API does not.

redsk avatar May 27 '20 19:05 redsk

Well, if I understand correctly, the host API would work only for hosts and not for full URLs which of course would be rather inconvenient as the parsing would be needed to be done by the caller.

How inconvenient this is largely depends on the application.

I created this issue because I believe that other users of Dispatch might have the same problem and because I think this library should be able to handle URL normalization properly (not only emojis, also other characters, as illustrated here), which the current url API does not.

Yep, I hear you. I'm not closing the issue, just pointing out that this isn't going to be a quick, drop-in fix like I originally thought. This will take some time to get right.

farmdawgnation avatar May 27 '20 19:05 farmdawgnation