Strip spaces in canonicalize_url
Fixes #132
This definitely works for the blank spaces, but in other cases, like if the URL starts with a colon, we will interpret it as a relative path, but in other tools, like cURL, what you get is an error: curl: (3) Bad URL, colon is first character (as it could be that you were missing the schema). What do you think @Gallaecio? Should this be addressed or not?
If it does not work in a web browser (and I don’t think it does), I don’t think we need to aim to support it either.
Incoming branch is 101 commits behind
Codecov Report
Merging #136 (849bae1) into master (4ba3539) will increase coverage by
0.01%. The diff coverage is100.00%.
@@ Coverage Diff @@
## master #136 +/- ##
==========================================
+ Coverage 95.96% 95.98% +0.01%
==========================================
Files 6 6
Lines 471 473 +2
Branches 90 91 +1
==========================================
+ Hits 452 454 +2
Misses 9 9
Partials 10 10
| Impacted Files | Coverage Δ | |
|---|---|---|
| w3lib/url.py | 98.63% <100.00%> (+0.01%) |
:arrow_up: |
Hey! Could you please elaborate on the fix? canonicalize_url main use is to generate URL fingerprints, i.e. to compare 2 URLs. So, after this change an URL with a whitespace in front and without it would be considered the same. Is it the idea behind the change, what's the motivation for this?
If we decide that URL-related functions should be stripping whitespace in URLs, shouldn't we be doing it in functions like safe_download_url, etc.?
by the way, a cool commit hash @wRAR :) 1dddddbbdbc...
Is it the idea behind the change, what's the motivation for this?
To me it is to emulate web browser behavior, i.e. entering those 2 URLs in a browser would lead the browser to load the same URL, so it makes sense to me that they are canonicalized the same way.
If we decide that URL-related functions should be stripping whitespace in URLs, shouldn't we be doing it in functions like safe_download_url, etc.?
Makes sense to me.
To me it is to emulate web browser behavior, i.e. entering those 2 URLs in a browser would lead the browser to load the same URL, so it makes sense to me that they are canonicalized the same way.
Our canoncalization is different though, it doesn't even guarantee that the URL will load the same afterwards. For example, order of query params may change, and empty params are removed by default, which may affect the result. I think it's an anti-pattern (or a bug) to send canonicalize_url output for downloading; this is what other w3lib.url functions are for.
The thing which bothers me is that currently in master the stripping behavior of canonicalize_url is not compatible with safe_download_url and friends. E.g. someone makes an error and sends a request with a space in front, it fails (because safe_download_url doesn't strip it), then the request is fixed (without a space), but canonicalize_url result is the same, so the second request is filtered out.
If query parameters use different order (an example of what canonicalize_url is handling), it's very likely that download result would be the same (not guaranteed, but very likely), and that it might be a spurious difference, so it makes total sense to return the same fingerprint. If there is a space in front or the URL, it's almost guaranteed that the download result would be different with and without stripping - so we should be returning different fingerprints, but after this change we return the same fingerprint.
Because of that, I think we should only release this change together with changes in other functions, and probably revert the change for now (to keep the master releasable, unless other changes are there soon enough).
@kmike Could you elaborate a little on which other changes would allow this to stay? I can try wrapping the edges depending on how complex this seems to be.
@felipeboffnunes it could be as easy as calling url.strip() in safe_download_url and safe_url_string, or maybe in some other functions in w3lib.url. The challenge would be to figure out what's the effect of doing so, and what's the right thing to do.
Sorry for that @Gallaecio @felipeboffnunes, it seems I keep complicating things, in this and some other PRs, trying to figure out what's the right thing do to :)
Speaking of a right thing to do, I wonder if calling url.strip is actually right, and if we should use w3lib.html.strip_html5_whitespace instead, or maybe a right thing to do is something else. What is a whitespace, basically, what should we strip? E.g. does .strip remove unicode whitespace or not, and what's right? Another question: it seems there are some references to RFCs discussed here about whitespace not allowed in front of the URL. Are they allowed at the end of URL? If they are, should we use lstrip instead of strip? Or maybe strip is fine? Are rules the same for whitespace in the beginning and in the end?
@kmike This may seem like a joke, but I believe the right path would be to look upon the current discussions, analyze if there is at the moment any provided info that explicitly goes its way to invalidating the url.strip() approach (or w3lib.html.strip_html5_whitespace) and, if we don't find any assertions about how it may break stuff, then we just do the basic. This works because, if this breaks stuff, people will come and call us out here ¯_(ツ)_/¯
According to the URL living standard, the right way to handle string stripping before URL parsing is:
_ASCII_TAB_OR_NEWLINE = "\t\n\r"
_C0_CONTROL = "".join(chr(n) for n in range(32))
_C0_CONTROL_OR_SPACE = _C0_CONTROL + " "
_ASCII_TAB_OR_NEWLINE_TRANSLATION_TABLE = {
ord(char): None for char in _ASCII_TAB_OR_NEWLINE
}
input = input.strip(_C0_CONTROL_OR_SPACE)
input = input.translate(_ASCII_TAB_OR_NEWLINE_TRANSLATION_TABLE)
i.e. remove leading and trailing ASCII characters from 00 (null) to 20 (space), and in the case of tabs, new lines and carriage returns, also remove them from the middle of the URL if found.
(extract from a Python implementation of the URL parsing and serialization algorithm from the URL living standard that I am hoping to finish by the beginning of next week).