crawly Add normalise_url parameter to Crawly.Middlewares.UniqueRequest

This parameter allows to customize the normalization behavior using a unary normalization function.

Apr 29 '24 16:04 adonig

Sorry, I don't understand the idea behind this change. Could you please explain?

Apr 29 '24 19:04 oltarasenko

It allows you to use a different normalization function without having to replace the whole UniqueRequest middleware. For instance, while some search engines consider URLs with and without a trailing slash as identical, others do not. Some go even further and treat /index.html the same way. Some transform the host part of the URL to lowercase and some even sort the query string parameters alphanumerically or resolve relative paths. For example I use this normalization function:

  def normalize_url(url) do
    parsed = URI.parse(url)

    if parsed.scheme in ["http", "https"] and parsed.host do
      %URI{
        scheme: parsed.scheme,
        host: String.downcase(parsed.host),
        path: parsed.path,
        query: parsed.query
      }
      |> URI.to_string()
    else
      nil
    end
  end

Apr 29 '24 20:04 adonig

Section 6 of RFC 3986 goes a bit deeper into the topic of URL normalization.

Apr 30 '24 11:04 adonig

I found out that Erlang comes with a RFC 3986-compliant URL normalization function: :uri_string.normalize/1

I believe it's still a good idea to allow people to provide their own implementation, because some might want to extend the behavior of the RFC, like for example Cloudflare or Kaspersky do.

May 02 '24 07:05 adonig