w3lib icon indicating copy to clipboard operation
w3lib copied to clipboard

Strip spaces in canonicalize_url

Open Gallaecio opened this issue 6 years ago • 3 comments

Fixes #132

Gallaecio avatar Sep 17 '19 12:09 Gallaecio

This definitely works for the blank spaces, but in other cases, like if the URL starts with a colon, we will interpret it as a relative path, but in other tools, like cURL, what you get is an error: curl: (3) Bad URL, colon is first character (as it could be that you were missing the schema). What do you think @Gallaecio? Should this be addressed or not?

noviluni avatar Feb 07 '20 17:02 noviluni

If it does not work in a web browser (and I don’t think it does), I don’t think we need to aim to support it either.

Gallaecio avatar Feb 11 '20 17:02 Gallaecio

Incoming branch is 101 commits behind

yozachar avatar Jul 20 '22 06:07 yozachar

Codecov Report

Merging #136 (849bae1) into master (4ba3539) will increase coverage by 0.01%. The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master     #136      +/-   ##
==========================================
+ Coverage   95.96%   95.98%   +0.01%     
==========================================
  Files           6        6              
  Lines         471      473       +2     
  Branches       90       91       +1     
==========================================
+ Hits          452      454       +2     
  Misses          9        9              
  Partials       10       10              
Impacted Files Coverage Δ
w3lib/url.py 98.63% <100.00%> (+0.01%) :arrow_up:

codecov[bot] avatar Oct 28 '22 16:10 codecov[bot]

Hey! Could you please elaborate on the fix? canonicalize_url main use is to generate URL fingerprints, i.e. to compare 2 URLs. So, after this change an URL with a whitespace in front and without it would be considered the same. Is it the idea behind the change, what's the motivation for this?

If we decide that URL-related functions should be stripping whitespace in URLs, shouldn't we be doing it in functions like safe_download_url, etc.?

kmike avatar Oct 29 '22 17:10 kmike

by the way, a cool commit hash @wRAR :) 1dddddbbdbc...

kmike avatar Oct 29 '22 17:10 kmike

Is it the idea behind the change, what's the motivation for this?

To me it is to emulate web browser behavior, i.e. entering those 2 URLs in a browser would lead the browser to load the same URL, so it makes sense to me that they are canonicalized the same way.

If we decide that URL-related functions should be stripping whitespace in URLs, shouldn't we be doing it in functions like safe_download_url, etc.?

Makes sense to me.

Gallaecio avatar Oct 29 '22 18:10 Gallaecio

To me it is to emulate web browser behavior, i.e. entering those 2 URLs in a browser would lead the browser to load the same URL, so it makes sense to me that they are canonicalized the same way.

Our canoncalization is different though, it doesn't even guarantee that the URL will load the same afterwards. For example, order of query params may change, and empty params are removed by default, which may affect the result. I think it's an anti-pattern (or a bug) to send canonicalize_url output for downloading; this is what other w3lib.url functions are for.

kmike avatar Oct 29 '22 21:10 kmike

The thing which bothers me is that currently in master the stripping behavior of canonicalize_url is not compatible with safe_download_url and friends. E.g. someone makes an error and sends a request with a space in front, it fails (because safe_download_url doesn't strip it), then the request is fixed (without a space), but canonicalize_url result is the same, so the second request is filtered out.

If query parameters use different order (an example of what canonicalize_url is handling), it's very likely that download result would be the same (not guaranteed, but very likely), and that it might be a spurious difference, so it makes total sense to return the same fingerprint. If there is a space in front or the URL, it's almost guaranteed that the download result would be different with and without stripping - so we should be returning different fingerprints, but after this change we return the same fingerprint.

kmike avatar Nov 04 '22 15:11 kmike

Because of that, I think we should only release this change together with changes in other functions, and probably revert the change for now (to keep the master releasable, unless other changes are there soon enough).

kmike avatar Nov 04 '22 15:11 kmike

@kmike Could you elaborate a little on which other changes would allow this to stay? I can try wrapping the edges depending on how complex this seems to be.

felipeboffnunes avatar Nov 04 '22 15:11 felipeboffnunes

@felipeboffnunes it could be as easy as calling url.strip() in safe_download_url and safe_url_string, or maybe in some other functions in w3lib.url. The challenge would be to figure out what's the effect of doing so, and what's the right thing to do.

Sorry for that @Gallaecio @felipeboffnunes, it seems I keep complicating things, in this and some other PRs, trying to figure out what's the right thing do to :)

Speaking of a right thing to do, I wonder if calling url.strip is actually right, and if we should use w3lib.html.strip_html5_whitespace instead, or maybe a right thing to do is something else. What is a whitespace, basically, what should we strip? E.g. does .strip remove unicode whitespace or not, and what's right? Another question: it seems there are some references to RFCs discussed here about whitespace not allowed in front of the URL. Are they allowed at the end of URL? If they are, should we use lstrip instead of strip? Or maybe strip is fine? Are rules the same for whitespace in the beginning and in the end?

kmike avatar Nov 04 '22 15:11 kmike

@kmike This may seem like a joke, but I believe the right path would be to look upon the current discussions, analyze if there is at the moment any provided info that explicitly goes its way to invalidating the url.strip() approach (or w3lib.html.strip_html5_whitespace) and, if we don't find any assertions about how it may break stuff, then we just do the basic. This works because, if this breaks stuff, people will come and call us out here ¯_(ツ)_/¯

felipeboffnunes avatar Nov 04 '22 16:11 felipeboffnunes

According to the URL living standard, the right way to handle string stripping before URL parsing is:

    _ASCII_TAB_OR_NEWLINE = "\t\n\r"
    _C0_CONTROL = "".join(chr(n) for n in range(32))
    _C0_CONTROL_OR_SPACE = _C0_CONTROL + " "
    _ASCII_TAB_OR_NEWLINE_TRANSLATION_TABLE = {
        ord(char): None for char in _ASCII_TAB_OR_NEWLINE
    }
    input = input.strip(_C0_CONTROL_OR_SPACE)
    input = input.translate(_ASCII_TAB_OR_NEWLINE_TRANSLATION_TABLE)

i.e. remove leading and trailing ASCII characters from 00 (null) to 20 (space), and in the case of tabs, new lines and carriage returns, also remove them from the middle of the URL if found.

(extract from a Python implementation of the URL parsing and serialization algorithm from the URL living standard that I am hoping to finish by the beginning of next week).

Gallaecio avatar Nov 04 '22 16:11 Gallaecio