mwscrape
mwscrape copied to clipboard
Download rendered articles from MediaWiki API to CouchDB
/mwscrape/ downloads rendered articles from MediaWiki sites via web API and stores them in CouchDB to enable further offline processing.
** Installation
/mwscrape/ depends on the following:
- [[http://couchdb.apache.org][CouchDB]] (1.3.0 or newer)
- [[http://python.org][Python 3 (3.6 or newer]])
Consult your operating system documentation and these projects' websites for installation instructions.
For example, on Ubuntu 18.04, the following command installs required packages:
#+BEGIN_SRC sh sudo apt-get install python3 python3-venv #+END_SRC
To install CouchDB first enable the Apache CouchDB package repository:
#+BEGIN_SRC sh echo "deb https://apache.bintray.com/couchdb-deb bionic main" | sudo tee -a /etc/apt/sources.list #+END_SRC
Then install the repository key:
#+BEGIN_SRC sh curl -L https://couchdb.apache.org/repo/bintray-pubkey.asc | sudo apt-key add - #+END_SRC
And finally update the repository cache and install the package:
#+BEGIN_SRC sh sudo apt-get update && sudo apt-get install couchdb #+END_SRC
Alternatively, run CouchDB with [[https://www.docker.com/][docker]]:
#+BEGIN_SRC sh
docker run --detach --rm --name couchdb
-v $(PWD)/.couchdb:/opt/couchdb/data
-p 5984:5984
couchdb:2
#+END_SRC
See [[https://hub.docker.com/_/couchdb/][CouchDB Docker image docs]] for more details.
Note that starting with CouchDB 3.0 an admin user must be set up. See [[https://docs.couchdb.org/en/stable/intro/security.html#creating-a-new-admin-user][CouchDB documentation]].
With docker:
#+BEGIN_SRC sh
docker run --detach --rm --name couchdb
-e COUCHDB_USER=admin
-e COUCHDB_PASSWORD=secret
-v $(PWD)/.couchdb:/opt/couchdb/data
-p 5984:5984
couchdb:3
#+END_SRC
By default CouchDB uses /snappy/ for file compression. Change ~file_compression~ configuration parameter in ~couchdb~ config section to /deflate_6/ (Maximum is /deflate_9/). This reduces database disc space usage significantly.
Create new Python virtual environment:
#+BEGIN_SRC sh python3 -m venv env-mwscrape #+END_SRC
Activate it:
#+BEGIN_SRC sh source env-mwscrape/bin/activate #+END_SRC
Install /mwscrape/ from source: #+BEGIN_SRC sh pip install https://github.com/itkach/mwscrape/tarball/master #+END_SRC
** Usage
#+BEGIN_SRC sh
usage: mwscrape [-h] [--site-path SITE_PATH] [--site-ext SITE_EXT] [-c COUCH] [--db DB] [--titles TITLES [TITLES ...]] [--start START] [--changes-since CHANGES_SINCE] [--recent-days RECENT_DAYS] [--recent] [--timeout TIMEOUT] [-S] [-r [SESSION ID]] [--sessions-db-name SESSIONS_DB_NAME] [--desc] [--delete-not-found] [--speed {0,1,2,3,4,5}] [site]
positional arguments: site MediaWiki site to scrape (host name), e.g. en.wikipedia.org
optional arguments: -h, --help show this help message and exit --site-path SITE_PATH MediaWiki site API pathDefault: /w/ --site-ext SITE_EXT MediaWiki site API script extensionDefault: .php -c COUCH, --couch COUCH CouchDB server URL. Default: http://localhost:5984 --db DB CouchDB database name. If not specified, the name will be derived from Mediawiki host name. --titles TITLES [TITLES ...] Download article pages with these names (titles). It name starts with @ it is interpreted as name of file containing titles, one per line, utf8 encoded. --start START Download all article pages beginning with this name --changes-since CHANGES_SINCE Download all article pages that change since specified time. Timestamp format is yyyymmddhhmmss. See https://www.mediawiki.org/wiki/Timestamp. Hours, minutes and seconds can be omited --recent-days RECENT_DAYS Number of days to look back for recent changes --recent Download recently changed articles only --timeout TIMEOUT Network communications timeout. Default: 30.0s -S, --siteinfo-only Fetch or update siteinfo, then exit -r [SESSION ID], --resume [SESSION ID] Resume previous scrape session. This relies on stats saved in mwscrape database. --sessions-db-name SESSIONS_DB_NAME Name of database where session info is stored. Default: mwscrape --desc Request all pages in descending order --delete-not-found Remove non-existing pages from the database --speed {0,1,2,3,4,5} Scrape speed
#+END_SRC
The following examples are for with CouchDB < 3.0 running in admin party mode.
To get English Wiktionary:
#+BEGIN_SRC sh mwscrape en.wiktionary.org #+END_SRC
To get the same but work through list of titles in reverse order:
#+BEGIN_SRC sh mwscrape en.wiktionary.org --desc #+END_SRC
Some sites expose Mediawiki API at path different from Wikipedia's default, specify it with ~--site-path~:
#+BEGIN_SRC sh mwscrape lurkmore.to --site-path=/ #+END_SRC
For CouchDB with admin user ~admin~ and password ~secret~ specify credentials as part of CouchDB URL:
#+BEGIN_SRC sh mwscrape -c http://admin:secret@localhost:5984 en.wiktionary.org #+END_SRC
/mwscrape/ compares page revisions reported by MediaWiki API with revisions of previously scraped pages in CouchDB and requests parsed page data if new revision is available.
/mwscrape/ also creates a CouchDB design document ~w~ with show function ~html~ to allow viewing article html returned by MediaWiki API and navigating to html of other collected articles. For example, to view rendered html for article /A/ in database /simple-wikipedia-org/, in a web browser go to the following address (assuming CouchDB is running on localhost):
http://127.0.0.1:5984/simple-wikipedia-org/_design/w/_show/html/A
If databases are combined via replication articles with the same title will be stored as [[https://wiki.apache.org/couchdb/Replication_and_conflicts][conflicts]]. /mwresolvec/ script is provided to merge conflicting versions (combine aliases, select highest MediaWiki article revision, discard other revisions). Usage:
#+BEGIN_SRC sh mwresolvec [-h] [-s START] [-b BATCH_SIZE] [-w WORKERS] [-v] couch_url
positional arguments: couch_url
optional arguments: -h, --help show this help message and exit -s START, --start START -b BATCH_SIZE, --batch-size BATCH_SIZE -w WORKERS, --workers WORKERS -v, --verbose
#+END_SRC
Example:
#+BEGIN_SRC sh mwresolvec http://localhost:5984/en-m-wikipedia-org #+END_SRC