crawlee Improve HTTP status code handling

There are several issues now which are related to the way we handle HTTP status codes in crawlers.

CheerioCrawler throws an exception when it encounters a 500+ status code and processes 400+ status codes.
PuppeteerCrawler does not throw an exception for any status code.
SessionPool makes both crawlers throw on 401, 403 and 429 status codes.

None of the above is configurable. We need to design an easy to understand process and configuration for the handling of status codes. Maybe it could all be left to SessionPool by making useSessionPool true by default. Or we could have two configurable layers and add throwOnStatusCodes option to crawlers and also retireSessionOnStatusCodes to SessionPool.

Oct 20 '20 06:10 mnmkng

Right. We've run into a situation where we need to handle a 403 explicitly, and have so far come up empty on how that could be best achieved.

It doesn't appear any of our handler code is ever invoked when this happens. While I could edit crawler_utils.js to make it so it is, does anyone know of a simpler workaround?

Apr 10 '21 14:04 Wintereise

Yeah, this is long overdue and we still have not found the time to add those features. A better, although a similarly awkward workaround as editing the crawler_utils.js would be this:

const { STATUS_CODES_BLOCKED } = require('apify/build/constants');

// It looks like this: [401, 403, 429], so you could:
delete STATUS_CODES_BLOCKED[1];

It's important to modify the array in place. You can also inject your own custom status codes. It's internal, so it can stop working at any point without a notice, but for the time being, it should solve your problem.

Apr 10 '21 15:04 mnmkng

#1423 adds configurability for the session pool:

const crawler = new CheerioCrawler({
  sessionPoolOptions: { blockedStatusCodes: [[401, 403, 429, 500] },
  // ...
});

will be available in crawlee 3.0.2

Jul 27 '22 17:07 B4nan