data-science icon indicating copy to clipboard operation
data-science copied to clipboard

MediaWiki API Project: Deletions and Notability Bias

Open chinaexpert1 opened this issue 4 months ago • 0 comments

Overview

Analyze deletion patterns and notability bias on Wikipedia by mining page deletion logs (delete, restore, move) and summarizing outcomes, reasons, and time trends across sensitive topic areas. Deliver a reproducible dataset, metrics, and a dashboard that surfaces where deletion actions are concentrated and which policy reasons dominate.

Action Items

If this is the beginning (research & design)

  • Define scope: focus on Main namespace (articles) and a shortlist of sensitive topical buckets (e.g., human rights, migration, public health, policing, local politicians, small nonprofits). Use keyword lists and seed categories to build topic → title matching rules (document heuristics).
  • Metrics & windows: monthly deletion counts, deletion rate per 10k pages (normalize by topic size), top policy reasons (speedy codes like G11/A7/A9 where present in comments), median time-to-deletion for newly created pages (when derivable), restore rate, and page move-to-draft frequency.
  • Methods: rely on MediaWiki Action API list=logevents (letype=delete|move|restore), prop=info (namespace), and optional prop=revisions (for creation timestamps on survivors). Avoid editor-level analysis; aggregate by topic and time.
  • Tooling (pick 1 per pair): requests or httpx; pandas or polars; duckdb or sqlite; altair or plotly.
  • Ethics & governance: do not infer sensitive personal attributes; aggregate reporting only; document that some logs are redacted and some policies are paraphrased in comments.

If researched and ready (implementation steps)

  1. Ingest logs

    • Pull deletion-related logs via list=logevents with letype=delete|delete/revision|restore|move (use leprop=title|timestamp|comment|details|type|action|user and paginate with continue).
    • Restrict to namespaces 0 (articles) and (optional) 118 (Draft) to study move-to-draft patterns.
  2. Topic assignment

    • Assign each log row to a topic bucket via: title regex, seed categories (when page still exists), and curated keyword lists. Persist rules to YAML, keep false-positive review set.
  3. Reason extraction

    • Parse comment for policy shorthand (e.g., G11, A7, A9, G3, COPYVIO) and free-text phrases; map to a standardized reason taxonomy.
  4. Creation time & time-to-deletion (when possible)

    • For titles that still exist or were restored, get first revision time via prop=revisions&rvlimit=1&rvdir=newer to estimate survival and time-to-deletion/restore on matched pairs.
  5. Normalize & aggregate

    • Compute monthly: total deletions, deletions per 10k pages (normalize by topic size using current category counts as an approximate denominator), reason distribution, restore rate, and move-to-draft rate.
  6. Deliver

    • Artifacts: deletion_logs.parquet, reasons.parquet, topic_monthly.parquet, metrics.csv.
    • Dashboard: trends by topic, stacked bars for reasons, restore vs delete ratios, and top titles by repeated actions.
    • Methods README with exact queries, assumptions, reason-mapping table, and limitations.
  7. Quality & Ops

    • Caching, retries with exponential backoff; store raw JSON; keep a query manifest (params + timestamps).
    • Unit tests: pagination/continuation, reason parser regex, topic assignment precision on a labeled sample.
    • Optional: scheduled monthly refresh via GitHub Actions.

Resources/Instructions

Docs (pin in repo)

MediaWiki Action API overview: https://www.mediawiki.org/wiki/API:Action_API
Logevents: https://www.mediawiki.org/wiki/API:Logevents
Revisions: https://www.mediawiki.org/wiki/API:Revisions
Info (namespace/protection): https://www.mediawiki.org/wiki/API:Info
Query & continuation: https://www.mediawiki.org/wiki/API:Query

Suggested libraries (choose pairs)

  • HTTP: requests | httpx
  • DataFrames: pandas | polars
  • Storage: duckdb | sqlite
  • Viz: altair | plotly

Sample queries

# Deletion logs for a window (ISO8601)
action=query&list=logevents&letype=delete|restore&leprop=title|timestamp|comment|details|type|action&lestart=2025-09-01T00:00:00Z&leend=2025-09-30T23:59:59Z&lelimit=max

# Moves (detect move-to-draft or out of draft)
action=query&list=logevents&letype=move&leprop=title|timestamp|comment|details|type|action&lelimit=max

# First revision timestamp (creation) for existing/restored pages
action=query&prop=revisions&rvprop=timestamp&rvlimit=1&rvdir=newer&titles=<TITLE>

Data handling notes

  • Respect maxlag and continue tokens; throttle politely.

  • Prefer pageid when resolvable; for deleted pages use (title, timestamp) as stable surrogate keys.

  • Keep a small hand-labeled set to score topic assignment and reason parsing.

  • If this issue requires access to 311 data, please answer the following questions:

    • Not applicable.
    • N/A
    • N/A
    • N/A

Project Outline (detailed plan for this idea) in details:

Research question Are certain sensitive topics disproportionately affected by page deletions, and which policy reasons are most frequently cited? Do restore rates differ by topic, and how have patterns shifted over time?

Data sources & modules

  • list=logevents for delete, restore, and move.
  • prop=revisions for first (creation) and, if needed, last timestamps of surviving titles.
  • prop=info for namespace checks and basic metadata.

Method

  1. Collect all deletion-related log events for the analysis window (e.g., 2018–present).
  2. Standardize fields, parse reasons, and assign topic buckets via heuristics + seed categories (documented).
  3. Join where possible to surviving/restored pages to estimate time-to-deletion / time-to-restore.
  4. Produce monthly aggregates with normalization (per 10k pages in topic).
  5. Visualize trends and reason distributions; flag months with structural breaks.

Key metrics

  • Deletions per 10k pages by topic and month.
  • Reason distribution (% G11 spam, % A7 notability, etc.).
  • Median time-to-deletion for newly created pages (where derivable).
  • Restore rate and time-to-restore; frequency of move-to-draft before deletion.
  • Titles with repeated actions (move→delete, delete→restore cycles).

Deliverables

  • Clean tables (deletion_logs.parquet, reasons.parquet, topic_monthly.parquet).
  • Reproducible notebook + reports/deletion_notability_bias.md.
  • Streamlit/Altair dashboard (trend lines, stacked reasons, topic comparison).

Caveats & limitations

  • Logs can be redacted; comments may not include standardized policy codes; deleted content is not accessible without special rights.
  • Topic assignment via titles/keywords is approximate—validate with a labeled sample and report precision/recall.
  • Normalization by current category counts is an approximation of topic size; note this in methods.

Implementation notes

  • Maintain a reason-mapping JSON (regex → canonical policy reason).
  • Print-and-retry on HTTP errors; terminate with a trace on dtype mismatches; log warnings for recoverable issues.
  • Save run config, git commit hash, and seed in each artifact’s metadata for reproducibility.

chinaexpert1 avatar Sep 13 '25 21:09 chinaexpert1