sparkler icon indicating copy to clipboard operation
sparkler copied to clipboard

Support for flexible focus language crawling framework

Open thammegowda opened this issue 8 years ago • 3 comments

The first task is defining and expressing the forcus crawling specification. The second subtask will be implementing that specification in sparkler.

Currently, we have support for URL based focus/filters. this has to be advanced with content-based focus.

Example task can be:

  1. "Crawl top news in Kannada language"
  2. "Crawl sports news in XYZ language"
  3. "Crawl cooking blogs that are in XYZ language"
  4. "Crawl poetry or song lyrics in XYZ language"
  5. "Craw news about earthquakes in XYZ language"

Sparkler should be able to express and accept this first 'focus' requirement, which is a combination of two filters:

  1. language filter, often rare languages (i.e. languages that are not supported by Google translator). There are over few thousands.
  2. domain such as cooking, news, sports news etc Maybe a few tens or hundreds max.

thammegowda avatar Nov 15 '17 00:11 thammegowda

great job Thamme! If I may this is "focused language crawling" as opposed to e.g., "focused multimedia crawling" or "web page crawling" etc. We should update the issue title to reflect that. Great job filing the issue.

chrismattmann avatar Nov 15 '17 05:11 chrismattmann

Thanks for the suggestion. the title is now updated 👍 Focus crawling is needed for everybody, but no existing crawler seems to do it right. we/sparkler now has the thinking cap for this task, we will propose a good solution for languages, multimedia, etc..

thammegowda avatar Nov 15 '17 06:11 thammegowda

Yeah - this could be really cool!

wmburke avatar Nov 15 '17 15:11 wmburke