MediaWiki API Project: Language Equity (Foreign Languages)
Overview
Measure language equity on sensitive topics by comparing how quickly key Wikipedia pages are created and updated across languages. Output a reproducible dataset and dashboard that quantify coverage (existence), “time-to-translation,” and update lag versus a reference language (e.g., English).
Action Items
If this is the beginning (research & design)
- Define scope: start with ~50 English Wikipedia pages in sensitive domains (public health, migration, elections, climate disasters, human rights).
- Choose language set: top 30 Wikipedias by article count + 10 low-resource languages for contrast (via
sitematrix), or all languages returned by each page’slanglinks. - Finalize metrics & windows: coverage status, time-to-first-presence (page exists Y/N and when), and update lag (delta between latest edit timestamps across languages).
- Methods plan: use MediaWiki Action API (
prop=langlinks,prop=revisions,list=sitematrix) and optionally Pageviews REST API for context; decide reference language(s) for lags. - Tooling pairs:
requestsorhttpx;pandasorpolars; storage induckdborsqlite; viz inAltairorPlotly.
If researched and ready (implementation steps)
-
Seed topics
- Create a
topics.csvof English page titles and (optional) Wikidata QIDs for disambiguation.
- Create a
-
Enumerate languages & interlanguage links
- For each English title:
prop=langlinksto get language codes and titles; merge withsitematrixto validate wiki domains.
- For each English title:
-
Fetch revision metadata
- For English and each linked language title:
prop=revisions&rvprop=timestamp|size|userid|comment&rvlimit=1&rvdir=older/neweras needed to get first and latest revision timestamps.
- For English and each linked language title:
-
Compute metrics
- Coverage: page present (1/0).
- Time-to-presence: first non-English creation time minus English creation time.
- Update lag: latest English edit time minus latest non-English edit time (days).
- Optional robustness: compare against median of top-N languages instead of only English.
-
Deliver
- Artifacts:
pages_raw.parquet,lang_presence.parquet,lags.parquet,metrics.csv. - Dashboard: heatmap (languages × topics) of update lags; coverage bar charts; language ranking tables.
- Methods README with exact queries, rate-limit/backoff notes, and caveats.
- Artifacts:
-
Quality & Ops
- Caching of responses; exponential backoff on
maxlag. - Unit tests for merge logic and timestamp math; schema checks.
- (Optional) Monthly GitHub Action to refresh a subset.
- Caching of responses; exponential backoff on
Resources/Instructions
API docs to pin in the repo
- MediaWiki Action API overview:
API:Action_API - Interlanguage links:
API:Langlinks - Revisions metadata:
API:Revisions - Site matrix (language list):
API:Sitematrix - Pageviews (REST; optional context):
Wikimedia REST API: Pageviews
Suggested libraries (pick pairs)
- HTTP:
requests|httpx - Frames:
pandas|polars - Storage:
duckdb|sqlite - Viz:
Altair|Plotly - Parsing (optional):
mwparserfromhell|wikitextparser
Sample queries to copy into notes
# Interlanguage links for an English page
action=query&prop=langlinks&titles=Migration_crisis&lllimit=max
# Latest revision timestamp for a given title on any wiki
action=query&prop=revisions&rvprop=timestamp|size&rvlimit=1&titles=<TITLE>
# First revision timestamp (creation): request oldest
action=query&prop=revisions&rvprop=timestamp&rvlimit=1&rvdir=newer&titles=<TITLE>
# Languages list (code ↔ wiki mapping)
action=sitematrix&formatversion=2
Ethics & reporting
-
Avoid punitive “league tables.” Emphasize resource constraints and volunteer capacity.
-
Aggregate results; no profiling of individual editors.
-
Document missingness (pages that never existed, renamed pages, disambiguation).
-
Be explicit that latest edit time ≠ content parity (it’s a proxy).
-
If this issue requires access to 311 data, please answer the following questions:
- Not applicable.
- N/A
- N/A
- N/A
Project Outline (detailed plan for this idea) in details below:
Research question Do sensitive-topic pages appear and stay up-to-date across languages at similar speeds, or do we observe systematic coverage gaps and update lags?
Data sources & modules
-
prop=langlinksto discover cross-language equivalents for each English seed page. -
list=sitematrixto enumerate languages and validate wiki domains. -
prop=revisionsto get first and latest timestamps per page/language. - (Optional) Pageviews REST API to contextualize demand vs. freshness.
Method
- Build a seed list of English pages in public health, migration, elections, climate disasters, and human rights (store in
topics.csv). - For each page, pull
langlinksto get target language titles; validate withsitematrix. - For English + each language title, fetch first and latest revision timestamps.
- Compute metrics per (topic, language): coverage, time-to-presence, update lag; summarize by language family/region.
- Visualize a lag heatmap (languages × topics), coverage distributions, and top-lagging topics.
Key metrics
- Coverage rate (% of topics that exist per language).
- Median time-to-presence (days).
- Median update lag (days) and % of topics with lag > thresholds (e.g., >30, >90 days).
- (Optional) Correlate lag with pageviews to see “high-demand but stale” cases.
Deliverables
- Clean tables (
lang_presence.parquet,first_latest_revisions.parquet,lags.parquet). - Reproducible notebook +
reports/language_equity.md. - Streamlit/Altair dashboard with filters (topic set, language subset, thresholds).
Caveats & limitations
- Timestamp proxies don’t guarantee semantic parity; translations may be partial.
- Some languages may title pages differently or merge topics; handle redirects carefully.
- API throttling and
maxlagrequire polite batching and retries.
Implementation notes
- Normalize timestamps to UTC; compute diffs in days (float).
- Use stable keys:
(wiki_db, pageid)when available; fall back to(lang_code, normalized_title). - Cache raw JSON responses and write a manifest of query params for reproducibility.