ring-codec Replace the form decoder with URLCodec

This addresses the decoding part of the following issue https://github.com/ring-clojure/ring/issues/269, which would fix my problem but does not address the other parts that are mentioned (encoder/custom percent encoder)
Would appreciate feedback whether I should address the other concerns, or this change itself could be merged in.

Dec 07 '16 05:12 iku000888

Ｈｍ, looks like the codec library encodes/decodes utf-16 differently than the Java one. (Long live testing!) Perhaps this is a side effect to consider, and instead allow an optional encoder/decoder function as a input parameter. What do you think?

FAIL in (test-form-decode) (codec.clj:61)

expected: (= (form-decode "a=foo%FE%FF%00%2Fbar" "UTF-16") {"a" "foo/bar"})

  actual: (not (= {"�" "景濾＀⽢慲"} {"a" "foo/bar"})) ```

Dec 07 '16 08:12 iku000888

Yes, I noticed that as well when I tried to fix it, and it's not just UTF-16 that it encodes differently.

Java treats every non-percent-encoded character as literal, while Commons Codec treats them all as a series of bytes. So for instance, take the previous Shift-JIS encoded example you used:

"%83%82%83W%83o%83P%83R%83%8F%83C"

Notice that there's a "W", "o", "P", "R"and "C" in there. Commons Codec says "treat them as ASCII bytes" whereas Java says, "treat them as literal characters". Now, that makes me think Java is correct and Commons Codec is wrong, but you mentioned that browsers encode the same way as Commons Codec?

Is there a site I can check the Shift-JIS encoding at? Otherwise I guess I'll build a small handler with the right encoding.

Dec 07 '16 11:12 weavejester

Found one! http://d.hatena.ne.jp/keywordmobile/

With the example under discussion. http://d.hatena.ne.jp/keywordsearchmobile?word=%83%82%83W%83o%83P%83R%83%8F%83C

(My perception tells me that browsers encode the Commons Codec way) (Tested with Chrome on a Windows Machine) (Let me know if you want more Japanese Text)

Dec 07 '16 15:12 iku000888

And as far as the "correctness" of Java vs Codec goes... I heard that the relevant RFC does not cover this, which is why there is room for multi-byte string garbling to happen depending on the implementation

Reading the percent encoding section makes me think so too, but on the other hand my literacy is limited...

https://tools.ietf.org/html/rfc3986#page-12

Dec 07 '16 15:12 iku000888

https://github.com/nwjs/chromium.src/blob/df7f8c8582b9a78c806a7fa1e9d3f3ba51f7a698/third_party/WebKit/Source/platform/weborigin/KURL.cpp#L628 https://github.com/nwjs/chromium.src/blob/df7f8c8582b9a78c806a7fa1e9d3f3ba51f7a698/url/url_util.cc#L646

the chromium encode flow is

convert string to utf-8 string
if ascii char append to output
if non-ascii char convert to % hex hex and append to output

so Shift-JIS, UTF-16 encoding should not be an issue, since the string should be convert to utf-8 string before encode?

I guess the decode flow should be just reverse?

assume the string is ascii string
if not %, append to output
if percent, capture the next two ascii char to form a utf-8 char and append to output
convert output from utf-8 string to utf-16 string (java string)

Dec 13 '16 11:12 zerg000000

so Shift-JIS, UTF-16 encoding should not be an issue, since the string should be convert to utf-8 string before encode?

But if that's the case we could just decode with UTF-8. The string "%83%82%83W%83o%83P%83R%83%8F%83C" represents a series of bytes encoded in Shift-JIS, not UTF-8.

user=> (.decode (URLCodec.) "%83%82%83W%83o%83P%83R%83%8F%83C" "Shift-JIS")
"モジバケコワイ"
user=> (.decode (URLCodec.) "%83%82%83W%83o%83P%83R%83%8F%83C" "UTF-8")
"���W�o�P�R���C"

Dec 13 '16 13:12 weavejester

@weavejester

I think adding support for a pluggable decoder is the best way to go for these reasons:

Can be done without changing the existing decode behaviour that could cause unexpected breakage somewhere on the planet with some exotic charset.
Can avoid the discussion of which implementation of the decoder/encoder is correct, by letting the user make the choice.

Here is what I have cooked up as an idea, https://github.com/iku000888/alt-params-middleware/ which I am thinking of testing it in my production app to make sure it works as expected.

Would greatly appreciate if I can get your feedback on this idea. If you like the idea, it would involve a patch to ring.util.codec and anothet to ring.middleware.params.

Dec 16 '16 18:12 iku000888

I don't think the encoder should be pluggable. We might as well use two libraries instead. If Chrome and other browsers decode in the same way as Commons Codec does, we should adopt that form.

Dec 16 '16 18:12 weavejester

@weavejester Alright, then! I have updated the failing utf-16 decoding test case data by conforming to how apache codec would encode "a" and "foo/bar". Am I correct that I should also go ahead and replace the encoder fn? Is there anything else to update?

Dec 17 '16 15:12 iku000888

The url-encode and url-decode functions need to be entirely rewritten as well.

Dec 17 '16 16:12 weavejester

@weavejester With the commons implementation?

Dec 17 '16 17:12 iku000888

@weavejester So I just went ahead and replaced the implementation of url-encode and url-decode using URL codec. Please let me know if this is not what was asked.

The tests are failing in travis, but run and pass just fine on my machine with jdk8. Would you know how to reconcile this assuming I implemented everything correctly?

Dec 17 '16 17:12 iku000888

With the commons implementation?

No, they need to be rewritten to handle encoding in the same way. We still need to differentiate between form encoding and URL encoding, so I don't think we can use Commons Codec.

In practice it means we'll probably need to set up a manual loop and recur, and percent-encode and url-encode will wind up looking very similar.

When I have time I'll take a look at this and write an updated implementation. If you want to you can have a bash at it yourself, but it'll be more complex than just substituting in the Commons Codec.

Dec 17 '16 18:12 weavejester

Thanks for the feedback and it totally makes sense. I might slowly tackle it over the next couple of weeks...

Dec 17 '16 19:12 iku000888