crawly icon indicating copy to clipboard operation
crawly copied to clipboard

Add normalise_url parameter to Crawly.Middlewares.UniqueRequest

Open adonig opened this issue 1 year ago • 4 comments

This parameter allows to customize the normalization behavior using a unary normalization function.

adonig avatar Apr 29 '24 16:04 adonig

Sorry, I don't understand the idea behind this change. Could you please explain?

oltarasenko avatar Apr 29 '24 19:04 oltarasenko

It allows you to use a different normalization function without having to replace the whole UniqueRequest middleware. For instance, while some search engines consider URLs with and without a trailing slash as identical, others do not. Some go even further and treat /index.html the same way. Some transform the host part of the URL to lowercase and some even sort the query string parameters alphanumerically or resolve relative paths. For example I use this normalization function:

  def normalize_url(url) do
    parsed = URI.parse(url)

    if parsed.scheme in ["http", "https"] and parsed.host do
      %URI{
        scheme: parsed.scheme,
        host: String.downcase(parsed.host),
        path: parsed.path,
        query: parsed.query
      }
      |> URI.to_string()
    else
      nil
    end
  end

adonig avatar Apr 29 '24 20:04 adonig

Section 6 of RFC 3986 goes a bit deeper into the topic of URL normalization.

adonig avatar Apr 30 '24 11:04 adonig

I found out that Erlang comes with a RFC 3986-compliant URL normalization function: :uri_string.normalize/1

I believe it's still a good idea to allow people to provide their own implementation, because some might want to extend the behavior of the RFC, like for example Cloudflare or Kaspersky do.

adonig avatar May 02 '24 07:05 adonig