Scott Mansfield

Results 16 issues of Scott Mansfield

The technical excellence org is looking for its first contribution. This is a very important decision because we need to set the bar very high for projects to be included....

from @truthpickle via livecoding.tv: The crawler should be able to keep track of the total amount of bandwidth used per domain and limit to a specified amount in a specified...

enhancement
fetch
parse

In Fetch and Parse there are outbound requests using Invocation.blah but have no User-Agent. One in Fetch is fixed but there are others missing. As well the Terminator and Exo...

enhancement
Minor
fetch
parse

The fetch, parse, and (maybe) index pages should heavily use caching to prevent duplication of work. The cache right now is either a frail connection to a single EC2 instance,...

enhancement
Major
fetch
parse
index

If nofollow then don't enqueue pages to fetch. if noindex then don't index the current page. default to index, follow, meaning index the current page and send link to the...

enhancement
parse
Medium

https://support.google.com/webmasters/answer/156184?hl=en

Major
fetch
parse

The robots.txt rules should survive restarts and be per-domain. See http://www.robotstxt.org/robotstxt.html for some examples. I didn't find any standard java parsers onlien in a quick search, so maybe a custom...

bug
enhancement
Major
fetch
parse

See: com.widowcrawler.fetch.FetchWorker Can we get better timing about the request at a lower level? The timing should be as close to on-the-wire time as possible.

enhancement
Minor
fetch

The @import CSS "directive" can cause yet another HTTP call to yet another endpoint. It should be included in the total size metric for each page.

bug
enhancement
parse

Need to recognize in a generic fashion the "prefix:" links and have a plugin parser for them

bug
enhancement
Major
parse