data_tooling
data_tooling copied to clipboard
Crawling curated list of sites: BigScience catalog app URLs
We want to be able to obtain all web and media content associated with a specific list pre-identified domain names.
This issue tracks domain names identified in the BigScience Data Cataloging Event
The steps to follow are:
- filter the CommonCrawl (or another archive) for all WARC records with one of the given domain names
- filtering all dumps form the last two years
- obtain overall metrics and metrics per domain name
- page counts, content languages, content types, etc.
- upload all of the relevant WARC records for each domain name to a HF dataset in the BigScience Catalogue Data Organization
- minimal filtering of WARC records to include human-readable pages AND pages that reference links to objects we want to download (e.g. PDFs)
- Extract the HTML tags corresponding to all URLs in the WARC entries
- optional: post-process the above list to identify outgoing links, extract their domain name, and content type
- optional: run text extraction
In particular, the list of domain names mentioned in outgoing link may be used to obtain a "depth 1 pseudo-crawl" by running the same process again
cc @sebastian-nagel
#self-assign