data-science icon indicating copy to clipboard operation
data-science copied to clipboard

MediaWiki API Project: Controversy and Protection Moderation

Open chinaexpert1 opened this issue 4 months ago • 0 comments

Overview

Quantify controversy & protection dynamics on policy-sensitive Wikipedia pages by tracking protection events (protect/unprotect, semi/full) and correlating them with edit/revert bursts over time. Deliver a reproducible dataset and dashboard that surface where, when, and how protection is used as a moderation tool.

Action Items

If this is the beginning (research & design)

  • Define scope: 50–200 English Wikipedia pages in domains like elections, policing, immigration, reproductive rights, public health, climate, and disinformation (store in seed_pages.csv with title,pageid,qid if available).
  • Metrics & windows: protection type/level, duration, frequency (events/page/year), time-to-protection after burst, revert ratio before/after protection, unique editors, and edit volume ±{7,14,30} days around events.
  • Methods: MediaWiki Action API modules — list=logevents (protection history), prop=info&inprop=protection (current state), prop=revisions (edit stream with timestamps, users, comments, sha1). Optional: Pageviews REST for demand context; ORES (if enabled) for damaging/goodfaith signals.
  • Tooling (choose pairs and keep consistent): requests or httpx; pandas or polars; duckdb or sqlite; altair or plotly.
  • Ethics: aggregate results; do not profile individual editors; document missing/hidden logs; avoid normative judgments.

If researched and ready (implementation steps)

  1. Seed & resolve pages

    • Ingest seed_pages.csv; resolve pageid via action=query&titles=<title> if missing; record redirects.
  2. Protection history

    • Pull list=logevents with letype=protect|modify|unprotect for each page; capture timestamp, level (e.g., semi/full), expiry, and reason; paginate via continue.
    • Snapshot current protection via prop=info&inprop=protection.
  3. Revision stream (context windows)

    • For each protection event, fetch revisions in windows (e.g., −30 to +30 days) via prop=revisions&rvprop=ids|timestamp|user|comment|sha1|size.
    • Compute per-day edits, unique editors, revert flags (sha1 repeat, or comment contains “revert/undid/rv”), and size deltas.
  4. Feature engineering

    • Event-level table: pageid, event_ts, level, duration, edits_before/after, revert_ratio_before/after, Δunique_editors, peak_day_edits, time_to_peak, current_protection_state.
  5. Analysis

    • KPIs: events/page/year; median protection duration; % semi vs full; typical time-to-protection after spike; change in revert ratio post-protection; pages with recurring protections (re-protect within 90 days).
    • Optional: structural-break tests on edit counts; correlate with pageviews.
  6. Deliver

    • Artifacts: protection_events.parquet, revision_windows.parquet, metrics.csv.
    • Dashboard: timelines with event overlays; small-multiples per topic; tables of top pages by protection frequency/duration.
    • Methods README: exact queries, rate-limit/backoff strategy, assumptions, and limitations.
  7. Quality & Ops

    • Caching, retries with exponential backoff; log maxlag responses; store raw JSON.
    • Unit tests: pagination correctness, windowing logic, revert-detection heuristics.
    • Optional: monthly GitHub Action to refresh new events.

Resources/Instructions

Docs & endpoints to pin in repo

MediaWiki Action API overview: https://www.mediawiki.org/wiki/API:Action_API
Logevents (protect/unprotect): https://www.mediawiki.org/wiki/API:Logevents
Page info & protection: https://www.mediawiki.org/wiki/API:Info
Revisions (timestamps, sha1, comments): https://www.mediawiki.org/wiki/API:Revisions
Page titles → pageids: https://www.mediawiki.org/wiki/API:Query
(Optional) Pageviews REST: https://wikitech.wikimedia.org/wiki/Analytics/AQS/Pageviews

Suggested libraries (pick pairs)

  • HTTP: requests | httpx
  • Frames: pandas | polars
  • Storage: duckdb | sqlite
  • Viz: altair | plotly
  • Parsing (optional): mwparserfromhell | wikitextparser

Sample queries to copy into notes

# Protection history for a page
action=query&list=logevents&letype=protect|modify|unprotect&lelimit=max&leprop=title|timestamp|details|comment&leend=<ISO_END>&lestart=<ISO_START>&letitle=<PAGE_TITLE>

# Current protection snapshot
action=query&prop=info&inprop=protection&titles=<PAGE_TITLE>

# Latest N revisions in a window
action=query&prop=revisions&rvprop=ids|timestamp|user|comment|sha1|size&rvstart=<ISO_END>&rvend=<ISO_START>&rvlimit=max&titles=<PAGE_TITLE>

Data handling notes

  • Use polite batching and honor maxlag; persist continue tokens; store raw responses.

  • Prefer pageid as key; handle renames/redirects. Normalize timestamps to UTC.

  • Revert heuristic: sha1 repeat or comment regex (revert|rv|undid|undo) (case-insensitive). Tune and document.

  • If this issue requires access to 311 data, please answer the following questions:

    • Not applicable.
    • N/A
    • N/A
    • N/A

Project Outline (detailed plan for this idea) in details below:

Research question When do sensitive-topic pages get protected, for how long, and what editing patterns (edits, reverts, editor counts) precede and follow those protections?

Data sources & modules

  • list=logevents with letype=protect|modify|unprotect for event history.
  • prop=info&inprop=protection for current state.
  • prop=revisions for edit streams (timestamps, sha1, comments, size).
  • (Optional) Pageviews REST for demand context.

Method

  1. Build a topic seed list and resolve pageid.
  2. Extract all protection events (timestamps, level, expiry) and derive durations (until unprotect/expiry).
  3. For each event, pull revision windows (e.g., ±30 days), compute daily edits, unique editors, revert ratio, and size change.
  4. Aggregate to page-level and topic-level metrics; identify recurrent protection patterns and median “time-to-protection” after spikes.
  5. Visualize timelines with overlays and produce ranked tables.

Key metrics

  • Events/page/year; median/mean protection duration; % semi/full; re-protect within 90 days.
  • Δrevert ratio pre→post; Δunique editors; time from spike to protection.
  • Pages in top decile by protection frequency or duration.

Deliverables

  • protection_events.parquet, revision_windows.parquet, page_metrics.parquet.
  • Reproducible notebook + reports/controversy_protection.md.
  • Streamlit/Altair dashboard with filters (topic, date range, protection level).

Caveats & limitations

  • Some logs/details may be hidden or suppressed; treat as missing.
  • Protection status ≠ resolution of controversy; it’s a moderation proxy.
  • Revert heuristics are imperfect; validate on a sampled set.

Implementation notes

  • Robust pagination (continue), retry/backoff with prints on failures.
  • Keep a query manifest (params + timestamps) for provenance.
  • Test window slicing and duration calculations; ensure correct handling of overlapping protection intervals.

chinaexpert1 avatar Sep 13 '25 21:09 chinaexpert1