MediaWiki API Project: Edit Wars and Reverts

Open chinaexpert1 opened this issue 4 months ago • 0 comments

Overview

Measure edit-war and revert intensity on culturally charged Wikipedia pages by detecting reverts, mutual-revert “episodes,” and burstiness in editing activity. Deliver a reproducible dataset and dashboard that surface where contention concentrates and how it evolves over time.

Action Items

If this is the beginning (research & design)

Define scope: 50–200 English Wikipedia articles across domains (elections, immigration, policing, reproductive rights, conflict, disinformation). Save as seed_pages.csv.
Time window: 2019 → present (adjustable).
Metrics & definitions: revert (via SHA1 match and/or tags/comments), revert ratio, mutual-revert pairs, episode detection (≥2 mutual reverts between the same two editors within 48h), editor mix (anon vs registered vs bot), burstiness (Gini of edits/day).
Revert detection plan:
- SHA1-based: a revision whose sha1 equals a previous revision’s sha1 implies a revert-to.
- Comment/tag-based: look for tags (rollback/undo) and comment regex (revert|rv|undid|rollback) (case-insensitive).
Tooling (pick one per pair, keep consistent): requests or httpx; pandas or polars; storage duckdb or sqlite; viz altair or plotly.
Ethics: aggregate reporting; no profiling of individual editors; clearly document limitations of heuristics.

If researched and ready (implementation steps)

Seed & resolve pages
- Load seed_pages.csv, resolve pageid via action=query&titles=<title>; record redirects.
Fetch revision streams
- For each page and window: prop=revisions&rvprop=ids|timestamp|user|userid|sha1|size|comment|tags&rvlimit=max with continuation. Normalize timestamps (UTC).
Identify bots & anon
- Build editor list from the stream; call list=users&ususers=<batch>&usprop=groups to mark bot accounts; mark anonymous users by missing userid (or user is IP).
Detect reverts
- SHA1 map per page to detect revert-to targets.
- Comment/tag heuristics for partial reverts/undos.
- Construct revert edges: (reverter → reverted, timestamp, type).
Episode detection & features
- Mutual-revert episode: at least one revert each way between two users within 48h (configurable).
- Per page/day: edits, reverts, unique editors, revert ratio, burstiness (Gini across day bins).
- Per pair: count of mutual revert episodes; median time between reverts.
Analyze
- KPIs: revert ratio per page/month; episodes/page/year; top pages by mutual-revert count; anon/reg/bot shares; pre/post trends around known events (optional).
Deliver
- Artifacts: revisions_raw.parquet, reverts.parquet, episodes.parquet, page_monthly.parquet, metrics.csv.
- Dashboard: timelines with revert overlays, network view (top K pages) of mutual-revert pairs, ranked tables.
Quality & Ops
- Cache raw JSON; retries with exponential backoff; honor maxlag.
- Unit tests: pagination continuity; SHA1 revert detection; comment regex; bot-label join integrity.
- Optional: scheduled monthly refresh via GitHub Actions.

Resources/Instructions

API docs to pin in repo

MediaWiki Action API overview: API:Action_API
Revisions (timestamps, users, sha1, tags): API:Revisions
Users (fetch groups to flag bots): API:Users
Query continuation & etiquette (maxlag): API:Query
(Optional) ORES scores for “damaging/goodfaith” (separate service) if enabled on target wiki

Suggested libraries (choose pairs)

HTTP: requests | httpx
DataFrames: pandas | polars
Storage: duckdb | sqlite
Viz: altair | plotly

Sample queries

# Resolve pageids from titles
action=query&titles=<TITLE>

# Full revision stream with metadata (use continuation)
action=query&prop=revisions&titles=<TITLE>&rvprop=ids|timestamp|user|userid|sha1|size|comment|tags&rvlimit=max

# Batch user groups (to detect bots)
action=query&list=users&ususers=<USER1>|<USER2>|...&usprop=groups

Data handling & ethics

Avoid editor-level callouts; aggregate at page/topic/time.
Document false positives/negatives for revert heuristics (partial reverts won’t always SHA1-match).
Large histories: consider limiting to target windows or top-N pages by recent edits to control volume.
If this issue requires access to 311 data, please answer the following questions:
- Not applicable.
- N/A
- N/A
- N/A

Project Outline (detailed plan for this idea) in details:

Research question Which sensitive-topic pages show the highest revert intensity and edit-war episodes, and how do editor types (anon/registered/bot) and burstiness relate to those patterns?

Data sources & modules

prop=revisions for edit streams (ids, timestamps, users, sha1, comment, tags).
list=users for user groups (bot).
(Optional) ORES for damaging/goodfaith signals.

Method

Build page cohort and resolve pageid.
Pull full revision streams within the window; normalize and deduplicate.
Detect reverts via SHA1 map + tag/comment heuristics; construct revert edges and mutual-revert episodes (48h rule).
Compute page-level monthly metrics: edits, reverts, revert ratio, unique editors, burstiness; pair-level mutual-revert counts.
Rank pages by revert ratio and episode density; visualize timelines with event overlays.

Key metrics

Revert ratio = reverts / total edits (per page, per month).
Mutual-revert episodes per page/year; median episode length.
Editor mix during episodes: % anon, % bot, % registered.
Burstiness (Gini of edits/day); correlation with protection events (optional join to protection dataset from the other issue).

Deliverables

Clean tables (revisions_raw.parquet, reverts.parquet, episodes.parquet, page_monthly.parquet).
Reproducible notebook + reports/edit_war_intensity.md.
Streamlit/Altair dashboard: page timelines, top pages, pair network (top K).

Caveats & limitations

SHA1 equality detects exact reverts only; partial reverts rely on tags/comments.
Some tags aren’t present on older edits; comments can be noisy.
High-volume pages produce large payloads—use paging & windowing; document any sampling.

Implementation notes

Honor continue tokens and maxlag; implement retries with try/except and print() diagnostics for I/O failures; terminate with a trace on dtype mismatches.
Keys: (pageid, revid) for revisions; (pageid, reverter_userid, reverted_userid, ts) for revert events.
Persist a query manifest (params, timestamps, continuation cursors) for provenance.

Sep 13 '25 21:09 chinaexpert1