Improve HTTP status code handling
There are several issues now which are related to the way we handle HTTP status codes in crawlers.
-
CheerioCrawlerthrows an exception when it encounters a 500+ status code and processes 400+ status codes. -
PuppeteerCrawlerdoes not throw an exception for any status code. -
SessionPoolmakes both crawlers throw on401,403and429status codes.
None of the above is configurable. We need to design an easy to understand process and configuration for the handling of status codes. Maybe it could all be left to SessionPool by making useSessionPool true by default. Or we could have two configurable layers and add throwOnStatusCodes option to crawlers and also retireSessionOnStatusCodes to SessionPool.
Right. We've run into a situation where we need to handle a 403 explicitly, and have so far come up empty on how that could be best achieved.
It doesn't appear any of our handler code is ever invoked when this happens. While I could edit crawler_utils.js to make it so it is, does anyone know of a simpler workaround?
Yeah, this is long overdue and we still have not found the time to add those features. A better, although a similarly awkward workaround as editing the crawler_utils.js would be this:
const { STATUS_CODES_BLOCKED } = require('apify/build/constants');
// It looks like this: [401, 403, 429], so you could:
delete STATUS_CODES_BLOCKED[1];
It's important to modify the array in place. You can also inject your own custom status codes. It's internal, so it can stop working at any point without a notice, but for the time being, it should solve your problem.
#1423 adds configurability for the session pool:
const crawler = new CheerioCrawler({
sessionPoolOptions: { blockedStatusCodes: [[401, 403, 429, 500] },
// ...
});
will be available in crawlee 3.0.2