Following redirects
Howdy! Just wondering if i'm implementing this right. I need to follow redirects, and there doesnt seem to be an option toggle so I tried implementing it this way. It seems to work, but would like some feedback!
Spidr.site(@url, max_depth: 2, limit: 20) do |spider|
spider.every_redirect_page do |page|
spider.visit_hosts << URI.parse(page.location).host
spider.enqueue page.location
end
end
Seems to throw an error if the location is "index.html" or similar...
Is the error coming from spidr or your code example? page.location grabs the Location header which may not always be absolute. Maybe try page.to_absolute(page.location)?
Probably should add to README.
Spidr should automatically follow redirects so the above code is redundant. The Page#each_url method converts everything yielded by Page#each_link to an absolute URL. Page#each_link in turn calls Page#each_redirect, which checks for the Location header. If you manually use page.location, it may not also be an absolute URL, so you'll need to call page.to_absolute(page.location).
I might consider adding Page#redirect_urls or Page#location_urls which would return absolute URLs for convenience.