php-spider icon indicating copy to clipboard operation
php-spider copied to clipboard

Robots.txt filtering?

Open spekulatius opened this issue 4 years ago • 6 comments

Hello @mvdbos,

I hope you are doing well!

I was wondering what your approach (if any) is to using the spider with robots.txt pattern for filtering?

The UriFilter seems to support only allow and no disallow and hence wouldn't support the most used case of robots files.

I was just thinking I ask if you got ideas on this before I build something myself.

Thank you in advance,

Peter

spekulatius avatar Mar 17 '21 20:03 spekulatius

@spekulatius If I am not mistaken, any uri matched by UriFilter.match() is removed. That means it can be used to remove things, which means it can be used to disallow?

This is how the filters are used in DiscovererSet:

    /**
     * Filter out any URI that matches any of the filters
     * @param UriInterface[] $discoveredUris
     */
    private function filter(array &$discoveredUris)
    {
        foreach ($discoveredUris as $k => $uri) {
            foreach ($this->filters as $filter) {
                if ($filter->match($uri)) {
                    unset($discoveredUris[$k]);
                }
            }
        }
    }

It would probably make most sense to implement a new filter that implements PreFetchFilterInterface that takes as a constructor argument the robots.txt file, and returns true if a URI should be skipped.

I would be very happy with a PR that contributes this filter and a test for it. It would be great feature to add for everyone.

mvdbos avatar Apr 04 '21 10:04 mvdbos

Hello @mvdbos,

Yeah, a prefetch filter makes a lot of sense here. I actually haven't checked the filters before as I thought it would somehow be part of the spider already and I'm just too blind to see how.

As it looks on the first impression it's as you describe: the new prefetch filter would parse the robots.txt content for disallow lines and match the URL pattern against the current URI. If one is found, true would be returned and the URI ignored. Sounds pretty straight forward at this point :)

Would one pass the contents of the robots.txt file or a file path/url to load with file_gets_contents into the filter?

Cheers, Peter

spekulatius avatar Apr 04 '21 13:04 spekulatius

I think passing in the contents as a string makes the filter simpler and single responsibility. It will be fine, since robots.txt files are generally not huge.

mvdbos avatar Apr 04 '21 15:04 mvdbos

Yes, makes sense. I'll prepare a PR as draft to have something to talk about over the coming days :+1:

spekulatius avatar Apr 04 '21 17:04 spekulatius

Hey @mvdbos

I've got a bit busy on contracting work and will need a bit more time on this. I hope this isn't an issue. Just thought I should update you as it's already a week.

Cheers, Peter

spekulatius avatar Apr 13 '21 12:04 spekulatius

Thanks for the update. Take your time

mvdbos avatar Apr 14 '21 05:04 mvdbos

Hey @spekulatius , a bit late but I just merged #93. Does that satisfy your need?

I can imagine you already found another solution in the meantime. :-D

mvdbos avatar Aug 14 '23 11:08 mvdbos

Hey @mvdbos

No worries, the PR looks like a good solution! I'll give it a try when a chance comes up! :)

Peter

spekulatius avatar Aug 14 '23 13:08 spekulatius