data-science icon indicating copy to clipboard operation
data-science copied to clipboard

MediaWiki API Project: Edit Wars and Reverts

Open chinaexpert1 opened this issue 4 months ago • 0 comments

Overview

Measure edit-war and revert intensity on culturally charged Wikipedia pages by detecting reverts, mutual-revert “episodes,” and burstiness in editing activity. Deliver a reproducible dataset and dashboard that surface where contention concentrates and how it evolves over time.

Action Items

If this is the beginning (research & design)

  • Define scope: 50–200 English Wikipedia articles across domains (elections, immigration, policing, reproductive rights, conflict, disinformation). Save as seed_pages.csv.

  • Time window: 2019 → present (adjustable).

  • Metrics & definitions: revert (via SHA1 match and/or tags/comments), revert ratio, mutual-revert pairs, episode detection (≥2 mutual reverts between the same two editors within 48h), editor mix (anon vs registered vs bot), burstiness (Gini of edits/day).

  • Revert detection plan:

    • SHA1-based: a revision whose sha1 equals a previous revision’s sha1 implies a revert-to.
    • Comment/tag-based: look for tags (rollback/undo) and comment regex (revert|rv|undid|rollback) (case-insensitive).
  • Tooling (pick one per pair, keep consistent): requests or httpx; pandas or polars; storage duckdb or sqlite; viz altair or plotly.

  • Ethics: aggregate reporting; no profiling of individual editors; clearly document limitations of heuristics.

If researched and ready (implementation steps)

  1. Seed & resolve pages

    • Load seed_pages.csv, resolve pageid via action=query&titles=<title>; record redirects.
  2. Fetch revision streams

    • For each page and window: prop=revisions&rvprop=ids|timestamp|user|userid|sha1|size|comment|tags&rvlimit=max with continuation. Normalize timestamps (UTC).
  3. Identify bots & anon

    • Build editor list from the stream; call list=users&ususers=<batch>&usprop=groups to mark bot accounts; mark anonymous users by missing userid (or user is IP).
  4. Detect reverts

    • SHA1 map per page to detect revert-to targets.
    • Comment/tag heuristics for partial reverts/undos.
    • Construct revert edges: (reverter → reverted, timestamp, type).
  5. Episode detection & features

    • Mutual-revert episode: at least one revert each way between two users within 48h (configurable).
    • Per page/day: edits, reverts, unique editors, revert ratio, burstiness (Gini across day bins).
    • Per pair: count of mutual revert episodes; median time between reverts.
  6. Analyze

    • KPIs: revert ratio per page/month; episodes/page/year; top pages by mutual-revert count; anon/reg/bot shares; pre/post trends around known events (optional).
  7. Deliver

    • Artifacts: revisions_raw.parquet, reverts.parquet, episodes.parquet, page_monthly.parquet, metrics.csv.
    • Dashboard: timelines with revert overlays, network view (top K pages) of mutual-revert pairs, ranked tables.
  8. Quality & Ops

    • Cache raw JSON; retries with exponential backoff; honor maxlag.
    • Unit tests: pagination continuity; SHA1 revert detection; comment regex; bot-label join integrity.
    • Optional: scheduled monthly refresh via GitHub Actions.

Resources/Instructions

API docs to pin in repo

MediaWiki Action API overview: API:Action_API
Revisions (timestamps, users, sha1, tags): API:Revisions
Users (fetch groups to flag bots): API:Users
Query continuation & etiquette (maxlag): API:Query
(Optional) ORES scores for “damaging/goodfaith” (separate service) if enabled on target wiki

Suggested libraries (choose pairs)

  • HTTP: requests | httpx
  • DataFrames: pandas | polars
  • Storage: duckdb | sqlite
  • Viz: altair | plotly

Sample queries

# Resolve pageids from titles
action=query&titles=<TITLE>

# Full revision stream with metadata (use continuation)
action=query&prop=revisions&titles=<TITLE>&rvprop=ids|timestamp|user|userid|sha1|size|comment|tags&rvlimit=max

# Batch user groups (to detect bots)
action=query&list=users&ususers=<USER1>|<USER2>|...&usprop=groups

Data handling & ethics

  • Avoid editor-level callouts; aggregate at page/topic/time.

  • Document false positives/negatives for revert heuristics (partial reverts won’t always SHA1-match).

  • Large histories: consider limiting to target windows or top-N pages by recent edits to control volume.

  • If this issue requires access to 311 data, please answer the following questions:

    • Not applicable.
    • N/A
    • N/A
    • N/A

Project Outline (detailed plan for this idea) in details:

Research question Which sensitive-topic pages show the highest revert intensity and edit-war episodes, and how do editor types (anon/registered/bot) and burstiness relate to those patterns?

Data sources & modules

  • prop=revisions for edit streams (ids, timestamps, users, sha1, comment, tags).
  • list=users for user groups (bot).
  • (Optional) ORES for damaging/goodfaith signals.

Method

  1. Build page cohort and resolve pageid.
  2. Pull full revision streams within the window; normalize and deduplicate.
  3. Detect reverts via SHA1 map + tag/comment heuristics; construct revert edges and mutual-revert episodes (48h rule).
  4. Compute page-level monthly metrics: edits, reverts, revert ratio, unique editors, burstiness; pair-level mutual-revert counts.
  5. Rank pages by revert ratio and episode density; visualize timelines with event overlays.

Key metrics

  • Revert ratio = reverts / total edits (per page, per month).
  • Mutual-revert episodes per page/year; median episode length.
  • Editor mix during episodes: % anon, % bot, % registered.
  • Burstiness (Gini of edits/day); correlation with protection events (optional join to protection dataset from the other issue).

Deliverables

  • Clean tables (revisions_raw.parquet, reverts.parquet, episodes.parquet, page_monthly.parquet).
  • Reproducible notebook + reports/edit_war_intensity.md.
  • Streamlit/Altair dashboard: page timelines, top pages, pair network (top K).

Caveats & limitations

  • SHA1 equality detects exact reverts only; partial reverts rely on tags/comments.
  • Some tags aren’t present on older edits; comments can be noisy.
  • High-volume pages produce large payloads—use paging & windowing; document any sampling.

Implementation notes

  • Honor continue tokens and maxlag; implement retries with try/except and print() diagnostics for I/O failures; terminate with a trace on dtype mismatches.
  • Keys: (pageid, revid) for revisions; (pageid, reverter_userid, reverted_userid, ts) for revert events.
  • Persist a query manifest (params, timestamps, continuation cursors) for provenance.

chinaexpert1 avatar Sep 13 '25 21:09 chinaexpert1