php-spider icon indicating copy to clipboard operation
php-spider copied to clipboard

Follow only internal redirects

Open spekulatius opened this issue 4 years ago • 2 comments

Hello @mvdbos

I haven't found time to look into the robots.txt filter discussed in the other issue. Sorry! I stumbled on a new question you might be able to shine some light on:

I'm trying to filter out URLs that have been redirected externally. I'm keen to implement a PostFetchFilter to keep it all within the spider. I was wondering if it possible to get the final URL (after redirects) in a PostFetchFilter? It seems like only the original URL is part of the Resource.

Appreciate any ideas on how you would approach this.

Cheers, Peter

spekulatius avatar Jul 24 '21 09:07 spekulatius

Hi @spekulatius , my apologies for the very late reply. One way (not tested by me) could be this:

  • Set the allow_redirects option on the Guzzle request handler when you construct it, and set the option track_redirects to true. This would store info about redirects in the X-Guzzle-Redirect-History and X-Guzzle-Redirect-Status-History headers.
  • If I am not mistaken, Resource contains the entire response (ResponseInterface), which you can use to inspect the headers.

mvdbos avatar Oct 30 '21 21:10 mvdbos

Hello @mvdbos,

no problem. We've all got plenty of issues to take care of :) My open robots.txt issue is a sign of this...

I'll try to get to a solution using the allow_redirects and let you know how it goes :+1:

Cheers, Peter

spekulatius avatar Nov 07 '21 18:11 spekulatius