data-science icon indicating copy to clipboard operation
data-science copied to clipboard

MediaWiki API Project: Language Equity (Foreign Languages)

Open chinaexpert1 opened this issue 4 months ago • 0 comments

Overview

Measure language equity on sensitive topics by comparing how quickly key Wikipedia pages are created and updated across languages. Output a reproducible dataset and dashboard that quantify coverage (existence), “time-to-translation,” and update lag versus a reference language (e.g., English).

Action Items

If this is the beginning (research & design)

  • Define scope: start with ~50 English Wikipedia pages in sensitive domains (public health, migration, elections, climate disasters, human rights).
  • Choose language set: top 30 Wikipedias by article count + 10 low-resource languages for contrast (via sitematrix), or all languages returned by each page’s langlinks.
  • Finalize metrics & windows: coverage status, time-to-first-presence (page exists Y/N and when), and update lag (delta between latest edit timestamps across languages).
  • Methods plan: use MediaWiki Action API (prop=langlinks, prop=revisions, list=sitematrix) and optionally Pageviews REST API for context; decide reference language(s) for lags.
  • Tooling pairs: requests or httpx; pandas or polars; storage in duckdb or sqlite; viz in Altair or Plotly.

If researched and ready (implementation steps)

  1. Seed topics

    • Create a topics.csv of English page titles and (optional) Wikidata QIDs for disambiguation.
  2. Enumerate languages & interlanguage links

    • For each English title: prop=langlinks to get language codes and titles; merge with sitematrix to validate wiki domains.
  3. Fetch revision metadata

    • For English and each linked language title: prop=revisions&rvprop=timestamp|size|userid|comment&rvlimit=1&rvdir=older/newer as needed to get first and latest revision timestamps.
  4. Compute metrics

    • Coverage: page present (1/0).
    • Time-to-presence: first non-English creation time minus English creation time.
    • Update lag: latest English edit time minus latest non-English edit time (days).
    • Optional robustness: compare against median of top-N languages instead of only English.
  5. Deliver

    • Artifacts: pages_raw.parquet, lang_presence.parquet, lags.parquet, metrics.csv.
    • Dashboard: heatmap (languages × topics) of update lags; coverage bar charts; language ranking tables.
    • Methods README with exact queries, rate-limit/backoff notes, and caveats.
  6. Quality & Ops

    • Caching of responses; exponential backoff on maxlag.
    • Unit tests for merge logic and timestamp math; schema checks.
    • (Optional) Monthly GitHub Action to refresh a subset.

Resources/Instructions

API docs to pin in the repo

  • MediaWiki Action API overview: API:Action_API
  • Interlanguage links: API:Langlinks
  • Revisions metadata: API:Revisions
  • Site matrix (language list): API:Sitematrix
  • Pageviews (REST; optional context): Wikimedia REST API: Pageviews

Suggested libraries (pick pairs)

  • HTTP: requests | httpx
  • Frames: pandas | polars
  • Storage: duckdb | sqlite
  • Viz: Altair | Plotly
  • Parsing (optional): mwparserfromhell | wikitextparser

Sample queries to copy into notes

# Interlanguage links for an English page
action=query&prop=langlinks&titles=Migration_crisis&lllimit=max

# Latest revision timestamp for a given title on any wiki
action=query&prop=revisions&rvprop=timestamp|size&rvlimit=1&titles=<TITLE>

# First revision timestamp (creation): request oldest
action=query&prop=revisions&rvprop=timestamp&rvlimit=1&rvdir=newer&titles=<TITLE>

# Languages list (code ↔ wiki mapping)
action=sitematrix&formatversion=2

Ethics & reporting

  • Avoid punitive “league tables.” Emphasize resource constraints and volunteer capacity.

  • Aggregate results; no profiling of individual editors.

  • Document missingness (pages that never existed, renamed pages, disambiguation).

  • Be explicit that latest edit time ≠ content parity (it’s a proxy).

  • If this issue requires access to 311 data, please answer the following questions:

    • Not applicable.
    • N/A
    • N/A
    • N/A

Project Outline (detailed plan for this idea) in details below:

Research question Do sensitive-topic pages appear and stay up-to-date across languages at similar speeds, or do we observe systematic coverage gaps and update lags?

Data sources & modules

  • prop=langlinks to discover cross-language equivalents for each English seed page.
  • list=sitematrix to enumerate languages and validate wiki domains.
  • prop=revisions to get first and latest timestamps per page/language.
  • (Optional) Pageviews REST API to contextualize demand vs. freshness.

Method

  1. Build a seed list of English pages in public health, migration, elections, climate disasters, and human rights (store in topics.csv).
  2. For each page, pull langlinks to get target language titles; validate with sitematrix.
  3. For English + each language title, fetch first and latest revision timestamps.
  4. Compute metrics per (topic, language): coverage, time-to-presence, update lag; summarize by language family/region.
  5. Visualize a lag heatmap (languages × topics), coverage distributions, and top-lagging topics.

Key metrics

  • Coverage rate (% of topics that exist per language).
  • Median time-to-presence (days).
  • Median update lag (days) and % of topics with lag > thresholds (e.g., >30, >90 days).
  • (Optional) Correlate lag with pageviews to see “high-demand but stale” cases.

Deliverables

  • Clean tables (lang_presence.parquet, first_latest_revisions.parquet, lags.parquet).
  • Reproducible notebook + reports/language_equity.md.
  • Streamlit/Altair dashboard with filters (topic set, language subset, thresholds).

Caveats & limitations

  • Timestamp proxies don’t guarantee semantic parity; translations may be partial.
  • Some languages may title pages differently or merge topics; handle redirects carefully.
  • API throttling and maxlag require polite batching and retries.

Implementation notes

  • Normalize timestamps to UTC; compute diffs in days (float).
  • Use stable keys: (wiki_db, pageid) when available; fall back to (lang_code, normalized_title).
  • Cache raw JSON responses and write a manifest of query params for reproducibility.

chinaexpert1 avatar Sep 13 '25 21:09 chinaexpert1