http-crawler icon indicating copy to clipboard operation
http-crawler copied to clipboard

A library for crawling websites

Results 9 http-crawler issues
Sort by recently updated
recently updated
newest added

@keaneokelley Can you give me a thumbs up/down if you'd like this merged?

Eg https://travis-ci.org/inglesp/http-crawler/jobs/285774955

#12 means that now we ignore URL schemes that cannot be handled by `requests`, but we should be able to identify mistyped URL schemes. See discussion in #6.

We currently extract links from HTML by looking for `src` and `href` attributes, and from CSS by looking for `@import` rules and `URI` tokens. A user might want to extract...

Right now, I can use http-crawler to tell me about links that return non-20x errors. That could be for two reasons: 1. The page should exist, and it’s broken (in...

We currently use `requests`'s default behaviour of following redirects. A user might not always want this, as they might want to use the library to find unnecessary redirects on a...

Added feature whether to follow redirects or not.

We currently follow all links, but in some cases this might not be appropriate We should find a way to allow the user to configure which links to follow.

We currently extract links from all pages that are on the same domain as the original URL that is passed to `crawl`. This might be too narrow (for instance, a...