Replace the form decoder with URLCodec
-
This addresses the decoding part of the following issue https://github.com/ring-clojure/ring/issues/269, which would fix my problem but does not address the other parts that are mentioned (encoder/custom percent encoder)
-
Would appreciate feedback whether I should address the other concerns, or this change itself could be merged in.
Hm, looks like the codec library encodes/decodes utf-16 differently than the Java one. (Long live testing!) Perhaps this is a side effect to consider, and instead allow an optional encoder/decoder function as a input parameter. What do you think?
FAIL in (test-form-decode) (codec.clj:61)
expected: (= (form-decode "a=foo%FE%FF%00%2Fbar" "UTF-16") {"a" "foo/bar"})
actual: (not (= {"�" "景濾⽢慲"} {"a" "foo/bar"})) ```
Yes, I noticed that as well when I tried to fix it, and it's not just UTF-16 that it encodes differently.
Java treats every non-percent-encoded character as literal, while Commons Codec treats them all as a series of bytes. So for instance, take the previous Shift-JIS encoded example you used:
"%83%82%83W%83o%83P%83R%83%8F%83C"
Notice that there's a "W", "o", "P", "R"and "C" in there. Commons Codec says "treat them as ASCII bytes" whereas Java says, "treat them as literal characters". Now, that makes me think Java is correct and Commons Codec is wrong, but you mentioned that browsers encode the same way as Commons Codec?
Is there a site I can check the Shift-JIS encoding at? Otherwise I guess I'll build a small handler with the right encoding.
Found one! http://d.hatena.ne.jp/keywordmobile/
With the example under discussion. http://d.hatena.ne.jp/keywordsearchmobile?word=%83%82%83W%83o%83P%83R%83%8F%83C
(My perception tells me that browsers encode the Commons Codec way) (Tested with Chrome on a Windows Machine) (Let me know if you want more Japanese Text)
And as far as the "correctness" of Java vs Codec goes... I heard that the relevant RFC does not cover this, which is why there is room for multi-byte string garbling to happen depending on the implementation
Reading the percent encoding section makes me think so too, but on the other hand my literacy is limited...
https://tools.ietf.org/html/rfc3986#page-12
https://github.com/nwjs/chromium.src/blob/df7f8c8582b9a78c806a7fa1e9d3f3ba51f7a698/third_party/WebKit/Source/platform/weborigin/KURL.cpp#L628 https://github.com/nwjs/chromium.src/blob/df7f8c8582b9a78c806a7fa1e9d3f3ba51f7a698/url/url_util.cc#L646
the chromium encode flow is
- convert string to utf-8 string
- if ascii char append to output
- if non-ascii char convert to
%hex hex and append to output
so Shift-JIS, UTF-16 encoding should not be an issue, since the string should be convert to utf-8 string before encode?
I guess the decode flow should be just reverse?
- assume the string is ascii string
- if not
%, append to output - if percent, capture the next two ascii char to form a utf-8 char and append to output
- convert output from utf-8 string to utf-16 string (java string)
so Shift-JIS, UTF-16 encoding should not be an issue, since the string should be convert to utf-8 string before encode?
But if that's the case we could just decode with UTF-8. The string "%83%82%83W%83o%83P%83R%83%8F%83C" represents a series of bytes encoded in Shift-JIS, not UTF-8.
user=> (.decode (URLCodec.) "%83%82%83W%83o%83P%83R%83%8F%83C" "Shift-JIS")
"モジバケコワイ"
user=> (.decode (URLCodec.) "%83%82%83W%83o%83P%83R%83%8F%83C" "UTF-8")
"���W�o�P�R���C"
@weavejester
I think adding support for a pluggable decoder is the best way to go for these reasons:
- Can be done without changing the existing decode behaviour that could cause unexpected breakage somewhere on the planet with some exotic charset.
- Can avoid the discussion of which implementation of the decoder/encoder is correct, by letting the user make the choice.
Here is what I have cooked up as an idea, https://github.com/iku000888/alt-params-middleware/ which I am thinking of testing it in my production app to make sure it works as expected.
Would greatly appreciate if I can get your feedback on this idea. If you like the idea, it would involve a patch to ring.util.codec and anothet to ring.middleware.params.
I don't think the encoder should be pluggable. We might as well use two libraries instead. If Chrome and other browsers decode in the same way as Commons Codec does, we should adopt that form.
@weavejester Alright, then! I have updated the failing utf-16 decoding test case data by conforming to how apache codec would encode "a" and "foo/bar". Am I correct that I should also go ahead and replace the encoder fn? Is there anything else to update?
The url-encode and url-decode functions need to be entirely rewritten as well.
@weavejester With the commons implementation?
@weavejester So I just went ahead and replaced the implementation of url-encode and url-decode using URL codec. Please let me know if this is not what was asked.
The tests are failing in travis, but run and pass just fine on my machine with jdk8. Would you know how to reconcile this assuming I implemented everything correctly?
With the commons implementation?
No, they need to be rewritten to handle encoding in the same way. We still need to differentiate between form encoding and URL encoding, so I don't think we can use Commons Codec.
In practice it means we'll probably need to set up a manual loop and recur, and percent-encode and url-encode will wind up looking very similar.
When I have time I'll take a look at this and write an updated implementation. If you want to you can have a bash at it yourself, but it'll be more complex than just substituting in the Commons Codec.
Thanks for the feedback and it totally makes sense. I might slowly tackle it over the next couple of weeks...