Robots.txt filtering?
Hello @mvdbos,
I hope you are doing well!
I was wondering what your approach (if any) is to using the spider with robots.txt pattern for filtering?
The UriFilter seems to support only allow and no disallow and hence wouldn't support the most used case of robots files.
I was just thinking I ask if you got ideas on this before I build something myself.
Thank you in advance,
Peter
@spekulatius If I am not mistaken, any uri matched by UriFilter.match() is removed. That means it can be used to remove things, which means it can be used to disallow?
This is how the filters are used in DiscovererSet:
/**
* Filter out any URI that matches any of the filters
* @param UriInterface[] $discoveredUris
*/
private function filter(array &$discoveredUris)
{
foreach ($discoveredUris as $k => $uri) {
foreach ($this->filters as $filter) {
if ($filter->match($uri)) {
unset($discoveredUris[$k]);
}
}
}
}
It would probably make most sense to implement a new filter that implements PreFetchFilterInterface that takes as a constructor argument the robots.txt file, and returns true if a URI should be skipped.
I would be very happy with a PR that contributes this filter and a test for it. It would be great feature to add for everyone.
Hello @mvdbos,
Yeah, a prefetch filter makes a lot of sense here. I actually haven't checked the filters before as I thought it would somehow be part of the spider already and I'm just too blind to see how.
As it looks on the first impression it's as you describe: the new prefetch filter would parse the robots.txt content for disallow lines and match the URL pattern against the current URI. If one is found, true would be returned and the URI ignored. Sounds pretty straight forward at this point :)
Would one pass the contents of the robots.txt file or a file path/url to load with file_gets_contents into the filter?
Cheers, Peter
I think passing in the contents as a string makes the filter simpler and single responsibility. It will be fine, since robots.txt files are generally not huge.
Yes, makes sense. I'll prepare a PR as draft to have something to talk about over the coming days :+1:
Hey @mvdbos
I've got a bit busy on contracting work and will need a bit more time on this. I hope this isn't an issue. Just thought I should update you as it's already a week.
Cheers, Peter
Thanks for the update. Take your time
Hey @spekulatius , a bit late but I just merged #93. Does that satisfy your need?
I can imagine you already found another solution in the meantime. :-D
Hey @mvdbos
No worries, the PR looks like a good solution! I'll give it a try when a chance comes up! :)
Peter