Crawling curated list of sites: Data Sourcing Candidate seeds spreadsheet

Open yjernite opened this issue 4 years ago • 0 comments

We want to be able to obtain all web and media content associated with a specific list pre-identified domain names.

This issue tracks potential crawling seeds identified BigScience Data Sourcing Participants, primarily in Spanish and SEA English (and three Chinese).

The steps to follow are:

filter the CommonCrawl (or another archive) for all WARC records with one of the given domain names

upload all of the relevant WARC records for each domain name to a HF dataset in the BigScience Catalogue Data Organization

minimal filtering of WARC records to include human-readable pages AND pages that reference links to objects we want to download (e.g. PDFs)
Extract the HTML tags corresponding to all URLs in the WARC entries
optional: post-process the above list to identify outgoing links, extract their domain name, and content type
optional: run text extraction

In particular, the list of domain names mentioned in outgoing link may be used to obtain a "depth 1 pseudo-crawl" by running the same process again

cc @sebastian-nagel

Dec 02 '21 16:12 yjernite