Add normalise_url parameter to Crawly.Middlewares.UniqueRequest
This parameter allows to customize the normalization behavior using a unary normalization function.
Sorry, I don't understand the idea behind this change. Could you please explain?
It allows you to use a different normalization function without having to replace the whole UniqueRequest middleware. For instance, while some search engines consider URLs with and without a trailing slash as identical, others do not. Some go even further and treat /index.html the same way. Some transform the host part of the URL to lowercase and some even sort the query string parameters alphanumerically or resolve relative paths. For example I use this normalization function:
def normalize_url(url) do
parsed = URI.parse(url)
if parsed.scheme in ["http", "https"] and parsed.host do
%URI{
scheme: parsed.scheme,
host: String.downcase(parsed.host),
path: parsed.path,
query: parsed.query
}
|> URI.to_string()
else
nil
end
end
Section 6 of RFC 3986 goes a bit deeper into the topic of URL normalization.
I found out that Erlang comes with a RFC 3986-compliant URL normalization function: :uri_string.normalize/1
I believe it's still a good idea to allow people to provide their own implementation, because some might want to extend the behavior of the RFC, like for example Cloudflare or Kaspersky do.