Follow only internal redirects
Hello @mvdbos
I haven't found time to look into the robots.txt filter discussed in the other issue. Sorry! I stumbled on a new question you might be able to shine some light on:
I'm trying to filter out URLs that have been redirected externally. I'm keen to implement a PostFetchFilter to keep it all within the spider. I was wondering if it possible to get the final URL (after redirects) in a PostFetchFilter? It seems like only the original URL is part of the Resource.
Appreciate any ideas on how you would approach this.
Cheers, Peter
Hi @spekulatius , my apologies for the very late reply. One way (not tested by me) could be this:
- Set the allow_redirects option on the Guzzle request handler when you construct it, and set the option
track_redirectstotrue. This would store info about redirects in theX-Guzzle-Redirect-HistoryandX-Guzzle-Redirect-Status-Historyheaders. - If I am not mistaken,
Resourcecontains the entire response (ResponseInterface), which you can use to inspect the headers.
Hello @mvdbos,
no problem. We've all got plenty of issues to take care of :) My open robots.txt issue is a sign of this...
I'll try to get to a solution using the allow_redirects and let you know how it goes :+1:
Cheers, Peter