Scott Mansfield issues

Results 16 issues of


                                            Scott Mansfield

Think about moving to the Technical Excellence organization

The technical excellence org is looking for its first contribution. This is a very important decision because we need to set the bar very high for projects to be included....

Throttle requests to a domain by total bandwidth in a specified period of time

from @truthpickle via livecoding.tv: The crawler should be able to keep track of the total amount of bandwidth used per domain and limit to a specified amount in a specified...

enhancement

fetch

parse

All outbound requests should have a User-Agent attached to them

In Fetch and Parse there are outbound requests using Invocation.blah but have no User-Agent. One in Fetch is fixed but there are others missing. As well the Terminator and Exo...

enhancement

Minor

fetch

parse

Have a better story around local caching independent of the crawling stages

The fetch, parse, and (maybe) index pages should heavily use caching to prevent duplication of work. The cache right now is either a frail connection to a single EC2 instance,...

enhancement

Major

fetch

parse

index

Check for robots meta tag while parsing a page

If nofollow then don't enqueue pages to fetch. if noindex then don't index the current page. default to index, follow, meaning index the current page and send link to the...

enhancement

parse

Medium

Support for sitemap.xml

https://support.google.com/webmasters/answer/156184?hl=en

Major

fetch

parse

Add support for robots.txt for any website

The robots.txt rules should survive restarts and be per-domain. See http://www.robotstxt.org/robotstxt.html for some examples. I didn't find any standard java parsers onlien in a quick search, so maybe a custom...

bug

enhancement

Major

fetch

parse

Investigate more accurate timing of website response

See: com.widowcrawler.fetch.FetchWorker Can we get better timing about the request at a lower level? The timing should be as close to on-the-wire time as possible.

enhancement

Minor

fetch

Add support for @import in CSS files and inline <style> contents

The @import CSS "directive" can cause yet another HTTP call to yet another endpoint. It should be included in the total size metric for each page.

bug

enhancement

parse

mailto:, phone:, etc break parsing

Need to recognize in a generic fashion the "prefix:" links and have a plugin parser for them

bug

enhancement

Major

parse