MediaWiki API Project: Edit Wars and Reverts
Overview
Measure edit-war and revert intensity on culturally charged Wikipedia pages by detecting reverts, mutual-revert “episodes,” and burstiness in editing activity. Deliver a reproducible dataset and dashboard that surface where contention concentrates and how it evolves over time.
Action Items
If this is the beginning (research & design)
-
Define scope: 50–200 English Wikipedia articles across domains (elections, immigration, policing, reproductive rights, conflict, disinformation). Save as
seed_pages.csv. -
Time window: 2019 → present (adjustable).
-
Metrics & definitions: revert (via SHA1 match and/or tags/comments), revert ratio, mutual-revert pairs, episode detection (≥2 mutual reverts between the same two editors within 48h), editor mix (anon vs registered vs bot), burstiness (Gini of edits/day).
-
Revert detection plan:
- SHA1-based: a revision whose
sha1equals a previous revision’ssha1implies a revert-to. - Comment/tag-based: look for tags (rollback/undo) and comment regex
(revert|rv|undid|rollback)(case-insensitive).
- SHA1-based: a revision whose
-
Tooling (pick one per pair, keep consistent):
requestsorhttpx;pandasorpolars; storageduckdborsqlite; vizaltairorplotly. -
Ethics: aggregate reporting; no profiling of individual editors; clearly document limitations of heuristics.
If researched and ready (implementation steps)
-
Seed & resolve pages
- Load
seed_pages.csv, resolvepageidviaaction=query&titles=<title>; record redirects.
- Load
-
Fetch revision streams
- For each page and window:
prop=revisions&rvprop=ids|timestamp|user|userid|sha1|size|comment|tags&rvlimit=maxwith continuation. Normalize timestamps (UTC).
- For each page and window:
-
Identify bots & anon
- Build editor list from the stream; call
list=users&ususers=<batch>&usprop=groupsto mark bot accounts; mark anonymous users by missinguserid(or user is IP).
- Build editor list from the stream; call
-
Detect reverts
- SHA1 map per page to detect revert-to targets.
- Comment/tag heuristics for partial reverts/undos.
- Construct revert edges:
(reverter → reverted, timestamp, type).
-
Episode detection & features
- Mutual-revert episode: at least one revert each way between two users within 48h (configurable).
- Per page/day: edits, reverts, unique editors, revert ratio, burstiness (Gini across day bins).
- Per pair: count of mutual revert episodes; median time between reverts.
-
Analyze
- KPIs: revert ratio per page/month; episodes/page/year; top pages by mutual-revert count; anon/reg/bot shares; pre/post trends around known events (optional).
-
Deliver
- Artifacts:
revisions_raw.parquet,reverts.parquet,episodes.parquet,page_monthly.parquet,metrics.csv. - Dashboard: timelines with revert overlays, network view (top K pages) of mutual-revert pairs, ranked tables.
- Artifacts:
-
Quality & Ops
- Cache raw JSON; retries with exponential backoff; honor
maxlag. - Unit tests: pagination continuity; SHA1 revert detection; comment regex; bot-label join integrity.
- Optional: scheduled monthly refresh via GitHub Actions.
- Cache raw JSON; retries with exponential backoff; honor
Resources/Instructions
API docs to pin in repo
MediaWiki Action API overview: API:Action_API
Revisions (timestamps, users, sha1, tags): API:Revisions
Users (fetch groups to flag bots): API:Users
Query continuation & etiquette (maxlag): API:Query
(Optional) ORES scores for “damaging/goodfaith” (separate service) if enabled on target wiki
Suggested libraries (choose pairs)
- HTTP:
requests|httpx - DataFrames:
pandas|polars - Storage:
duckdb|sqlite - Viz:
altair|plotly
Sample queries
# Resolve pageids from titles
action=query&titles=<TITLE>
# Full revision stream with metadata (use continuation)
action=query&prop=revisions&titles=<TITLE>&rvprop=ids|timestamp|user|userid|sha1|size|comment|tags&rvlimit=max
# Batch user groups (to detect bots)
action=query&list=users&ususers=<USER1>|<USER2>|...&usprop=groups
Data handling & ethics
-
Avoid editor-level callouts; aggregate at page/topic/time.
-
Document false positives/negatives for revert heuristics (partial reverts won’t always SHA1-match).
-
Large histories: consider limiting to target windows or top-N pages by recent edits to control volume.
-
If this issue requires access to 311 data, please answer the following questions:
- Not applicable.
- N/A
- N/A
- N/A
Project Outline (detailed plan for this idea) in details:
Research question Which sensitive-topic pages show the highest revert intensity and edit-war episodes, and how do editor types (anon/registered/bot) and burstiness relate to those patterns?
Data sources & modules
-
prop=revisionsfor edit streams (ids, timestamps, users, sha1, comment, tags). -
list=usersfor user groups (bot). - (Optional) ORES for damaging/goodfaith signals.
Method
- Build page cohort and resolve
pageid. - Pull full revision streams within the window; normalize and deduplicate.
- Detect reverts via SHA1 map + tag/comment heuristics; construct revert edges and mutual-revert episodes (48h rule).
- Compute page-level monthly metrics: edits, reverts, revert ratio, unique editors, burstiness; pair-level mutual-revert counts.
- Rank pages by revert ratio and episode density; visualize timelines with event overlays.
Key metrics
- Revert ratio = reverts / total edits (per page, per month).
- Mutual-revert episodes per page/year; median episode length.
- Editor mix during episodes: % anon, % bot, % registered.
- Burstiness (Gini of edits/day); correlation with protection events (optional join to protection dataset from the other issue).
Deliverables
- Clean tables (
revisions_raw.parquet,reverts.parquet,episodes.parquet,page_monthly.parquet). - Reproducible notebook +
reports/edit_war_intensity.md. - Streamlit/Altair dashboard: page timelines, top pages, pair network (top K).
Caveats & limitations
- SHA1 equality detects exact reverts only; partial reverts rely on tags/comments.
- Some tags aren’t present on older edits; comments can be noisy.
- High-volume pages produce large payloads—use paging & windowing; document any sampling.
Implementation notes
- Honor
continuetokens andmaxlag; implement retries withtry/exceptandprint()diagnostics for I/O failures; terminate with a trace on dtype mismatches. - Keys:
(pageid, revid)for revisions;(pageid, reverter_userid, reverted_userid, ts)for revert events. - Persist a query manifest (params, timestamps, continuation cursors) for provenance.