Scott Mansfield
Scott Mansfield
The technical excellence org is looking for its first contribution. This is a very important decision because we need to set the bar very high for projects to be included....
from @truthpickle via livecoding.tv: The crawler should be able to keep track of the total amount of bandwidth used per domain and limit to a specified amount in a specified...
In Fetch and Parse there are outbound requests using Invocation.blah but have no User-Agent. One in Fetch is fixed but there are others missing. As well the Terminator and Exo...
The fetch, parse, and (maybe) index pages should heavily use caching to prevent duplication of work. The cache right now is either a frail connection to a single EC2 instance,...
If nofollow then don't enqueue pages to fetch. if noindex then don't index the current page. default to index, follow, meaning index the current page and send link to the...
https://support.google.com/webmasters/answer/156184?hl=en
The robots.txt rules should survive restarts and be per-domain. See http://www.robotstxt.org/robotstxt.html for some examples. I didn't find any standard java parsers onlien in a quick search, so maybe a custom...
See: com.widowcrawler.fetch.FetchWorker Can we get better timing about the request at a lower level? The timing should be as close to on-the-wire time as possible.
The @import CSS "directive" can cause yet another HTTP call to yet another endpoint. It should be included in the total size metric for each page.
Need to recognize in a generic fashion the "prefix:" links and have a plugin parser for them