MediaWiki API Project: Controversy and Protection Moderation
Overview
Quantify controversy & protection dynamics on policy-sensitive Wikipedia pages by tracking protection events (protect/unprotect, semi/full) and correlating them with edit/revert bursts over time. Deliver a reproducible dataset and dashboard that surface where, when, and how protection is used as a moderation tool.
Action Items
If this is the beginning (research & design)
- Define scope: 50–200 English Wikipedia pages in domains like elections, policing, immigration, reproductive rights, public health, climate, and disinformation (store in
seed_pages.csvwithtitle,pageid,qidif available). - Metrics & windows: protection type/level, duration, frequency (events/page/year), time-to-protection after burst, revert ratio before/after protection, unique editors, and edit volume ±{7,14,30} days around events.
- Methods: MediaWiki Action API modules —
list=logevents(protection history),prop=info&inprop=protection(current state),prop=revisions(edit stream with timestamps, users, comments, sha1). Optional: Pageviews REST for demand context; ORES (if enabled) for damaging/goodfaith signals. - Tooling (choose pairs and keep consistent):
requestsorhttpx;pandasorpolars;duckdborsqlite;altairorplotly. - Ethics: aggregate results; do not profile individual editors; document missing/hidden logs; avoid normative judgments.
If researched and ready (implementation steps)
-
Seed & resolve pages
- Ingest
seed_pages.csv; resolvepageidviaaction=query&titles=<title>if missing; record redirects.
- Ingest
-
Protection history
- Pull
list=logeventswithletype=protect|modify|unprotectfor each page; capture timestamp, level (e.g., semi/full), expiry, and reason; paginate viacontinue. - Snapshot current protection via
prop=info&inprop=protection.
- Pull
-
Revision stream (context windows)
- For each protection event, fetch revisions in windows (e.g., −30 to +30 days) via
prop=revisions&rvprop=ids|timestamp|user|comment|sha1|size. - Compute per-day edits, unique editors, revert flags (sha1 repeat, or comment contains “revert/undid/rv”), and size deltas.
- For each protection event, fetch revisions in windows (e.g., −30 to +30 days) via
-
Feature engineering
- Event-level table: pageid, event_ts, level, duration, edits_before/after, revert_ratio_before/after, Δunique_editors, peak_day_edits, time_to_peak, current_protection_state.
-
Analysis
- KPIs: events/page/year; median protection duration; % semi vs full; typical time-to-protection after spike; change in revert ratio post-protection; pages with recurring protections (re-protect within 90 days).
- Optional: structural-break tests on edit counts; correlate with pageviews.
-
Deliver
- Artifacts:
protection_events.parquet,revision_windows.parquet,metrics.csv. - Dashboard: timelines with event overlays; small-multiples per topic; tables of top pages by protection frequency/duration.
- Methods README: exact queries, rate-limit/backoff strategy, assumptions, and limitations.
- Artifacts:
-
Quality & Ops
- Caching, retries with exponential backoff; log
maxlagresponses; store raw JSON. - Unit tests: pagination correctness, windowing logic, revert-detection heuristics.
- Optional: monthly GitHub Action to refresh new events.
- Caching, retries with exponential backoff; log
Resources/Instructions
Docs & endpoints to pin in repo
MediaWiki Action API overview: https://www.mediawiki.org/wiki/API:Action_API
Logevents (protect/unprotect): https://www.mediawiki.org/wiki/API:Logevents
Page info & protection: https://www.mediawiki.org/wiki/API:Info
Revisions (timestamps, sha1, comments): https://www.mediawiki.org/wiki/API:Revisions
Page titles → pageids: https://www.mediawiki.org/wiki/API:Query
(Optional) Pageviews REST: https://wikitech.wikimedia.org/wiki/Analytics/AQS/Pageviews
Suggested libraries (pick pairs)
- HTTP:
requests|httpx - Frames:
pandas|polars - Storage:
duckdb|sqlite - Viz:
altair|plotly - Parsing (optional):
mwparserfromhell|wikitextparser
Sample queries to copy into notes
# Protection history for a page
action=query&list=logevents&letype=protect|modify|unprotect&lelimit=max&leprop=title|timestamp|details|comment&leend=<ISO_END>&lestart=<ISO_START>&letitle=<PAGE_TITLE>
# Current protection snapshot
action=query&prop=info&inprop=protection&titles=<PAGE_TITLE>
# Latest N revisions in a window
action=query&prop=revisions&rvprop=ids|timestamp|user|comment|sha1|size&rvstart=<ISO_END>&rvend=<ISO_START>&rvlimit=max&titles=<PAGE_TITLE>
Data handling notes
-
Use polite batching and honor
maxlag; persistcontinuetokens; store raw responses. -
Prefer
pageidas key; handle renames/redirects. Normalize timestamps to UTC. -
Revert heuristic: sha1 repeat or comment regex
(revert|rv|undid|undo)(case-insensitive). Tune and document. -
If this issue requires access to 311 data, please answer the following questions:
- Not applicable.
- N/A
- N/A
- N/A
Project Outline (detailed plan for this idea) in details below:
Research question When do sensitive-topic pages get protected, for how long, and what editing patterns (edits, reverts, editor counts) precede and follow those protections?
Data sources & modules
-
list=logeventswithletype=protect|modify|unprotectfor event history. -
prop=info&inprop=protectionfor current state. -
prop=revisionsfor edit streams (timestamps, sha1, comments, size). - (Optional) Pageviews REST for demand context.
Method
- Build a topic seed list and resolve
pageid. - Extract all protection events (timestamps, level, expiry) and derive durations (until unprotect/expiry).
- For each event, pull revision windows (e.g., ±30 days), compute daily edits, unique editors, revert ratio, and size change.
- Aggregate to page-level and topic-level metrics; identify recurrent protection patterns and median “time-to-protection” after spikes.
- Visualize timelines with overlays and produce ranked tables.
Key metrics
- Events/page/year; median/mean protection duration; % semi/full; re-protect within 90 days.
- Δrevert ratio pre→post; Δunique editors; time from spike to protection.
- Pages in top decile by protection frequency or duration.
Deliverables
-
protection_events.parquet,revision_windows.parquet,page_metrics.parquet. - Reproducible notebook +
reports/controversy_protection.md. - Streamlit/Altair dashboard with filters (topic, date range, protection level).
Caveats & limitations
- Some logs/details may be hidden or suppressed; treat as missing.
- Protection status ≠ resolution of controversy; it’s a moderation proxy.
- Revert heuristics are imperfect; validate on a sampled set.
Implementation notes
- Robust pagination (
continue), retry/backoff with prints on failures. - Keep a query manifest (params + timestamps) for provenance.
- Test window slicing and duration calculations; ensure correct handling of overlapping protection intervals.