http-crawler issues

Add Keane to AUTHORS

1

@keaneokelley Can you give me a thumbs up/down if you'd like this merged?

inglesp

Tests are flaky

Eg https://travis-ci.org/inglesp/http-crawler/jobs/285774955

inglesp

Should be able to identify invalid URL schemes

#12 means that now we ignore URL schemes that cannot be handled by `requests`, but we should be able to identify mistyped URL schemes. See discussion in #6.

inglesp

Allow user to choose how links are extracted from responses

1

We currently extract links from HTML by looking for `src` and `href` attributes, and from CSS by looking for `@import` rules and `URI` tokens. A user might want to extract...

inglesp

No way to tell where a broken link is linked from

2

Right now, I can use http-crawler to tell me about links that return non-20x errors. That could be for two reasons: 1. The page should exist, and it’s broken (in...

alexwlchan

Allow user to choose whether to follow redirects

11

We currently use `requests`'s default behaviour of following redirects. A user might not always want this, as they might want to use the library to find unnecessary redirects on a...

inglesp

added redirect support

Added feature whether to follow redirects or not.

rkrp

Allow user to choose which links to follow

1

We currently follow all links, but in some cases this might not be appropriate We should find a way to allow the user to configure which links to follow.

inglesp

Allow user to choose which pages to extract links from

We currently extract links from all pages that are on the same domain as the original URL that is passed to `crawl`. This might be too narrow (for instance, a...

inglesp

http-crawler
http-crawler copied to clipboard

Metadata

Add Keane to AUTHORS

Tests are flaky

Should be able to identify invalid URL schemes

Allow user to choose how links are extracted from responses

No way to tell where a broken link is linked from

Allow user to choose whether to follow redirects

added redirect support

Allow user to choose which links to follow

Allow user to choose which pages to extract links from

← Metadata

Owner

Metadata

http-crawler http-crawler copied to clipboard

Metadata

← Metadata

Owner

Metadata

http-crawler
http-crawler copied to clipboard