actor-scraper icon indicating copy to clipboard operation
actor-scraper copied to clipboard

Add sitemap scraper

Open foxt451 opened this issue 1 month ago • 0 comments

The framework was mostly copied from cheerio-scraper (but trimmed in a lot of places), and the request handler inspired by sitemap scraper in WCC.

There has been discussion of how to avoid duplicating code between wcc and here, and some advised to extract sitemap scraper into a package. But I then checked the code of sitemap crawler in WCC, and it's really coupled to wcc, and itself is quite short, so I just copied it over with modifications.

BUT, the one thing I copied without changes at all is discoverValidSitemaps util from WCC. I'd like to extract it somewhere e.g. into scraper-tools, because it seems like quite a generic function.

Tested locally - for now will just push dataset items with a url and status code for each page.

Closes https://github.com/apify/apify-sdk-js/issues/486

foxt451 avatar Dec 18 '25 20:12 foxt451