data-science icon indicating copy to clipboard operation
data-science copied to clipboard

MediaWiki API Project: Watcher/Attention Symmetry

Open chinaexpert1 opened this issue 4 months ago • 0 comments

Overview

Quantify watcher/attention asymmetry on sensitive Wikipedia pages by comparing demand signals (pageviews) to supply/oversight signals (watchers*, edits, unique editors, talk-page activity). Deliver a reproducible dataset and dashboard that spotlight pages with high demand but low oversight—and vice versa. *Watcher counts are sometimes restricted; we’ll fall back to proxy metrics when unavailable.

Action Items

If this is the beginning (research & design)

  • Scope: 100–300 English Wikipedia pages across elections, policing, migration, public health, human rights, climate. Save as seed_pages.csv (title,pageid,qid,topic).
  • Metrics & windows (suggested): monthly from 2019 → present.
  • Attention metrics: pageviews (REST API), external referrals (optional), trending spikes.
  • Oversight metrics: watchers* (if available), edits/month, unique editors/month, revert ratio, talk-page edits/month.
  • Define an Attention Gap Index (AGI): demand ÷ oversight (normalized ranks or z-scores).
  • Libraries (pick pairs): requests or httpx; pandas or polars; duckdb or sqlite; altair or plotly.

If researched and ready (implementation steps)

  1. Seed & resolve

    • Resolve pageid via action=query&titles=<title>, track redirects; store (title,pageid,qid,topic).
  2. Demand (attention) pulls

    • Pageviews: Wikimedia REST pageviews/per-article monthly series for each page; compute level, volatility, and spikes (e.g., 95th pct).
  3. Oversight pulls

    • Try prop=info&inprop=watchers|visitingwatchers (note: may be restricted/not returned).
    • Always compute proxies: monthly edits (prop=revisions), unique editors, revert ratio (comment/tag heuristics), and talk-page edits (Talk:<title>).
  4. Features & index

    • Normalize each metric per-month (z-scores or percentile ranks within topic).
    • AGI = demand_norm / oversight_norm (guard against div-by-zero; add ε).
    • Label pages with persistent high AGI (e.g., top decile for ≥3 months).
  5. Deliver

    • Artifacts: pageviews.parquet, rev_monthly.parquet, talk_monthly.parquet, watchers_snapshot.parquet (when available), agi_monthly.parquet, metrics.csv.
    • Dashboard: per-page timelines (views vs edits), AGI heatmap (pages × months), top-pages table with filters (topic, month).
    • Methods README: data gaps, watcher restrictions, formulas, and caveats.
  6. Quality & Ops

    • Caching and retries with exponential backoff; honor maxlag.
    • Tests: continuation handling, talk-page mapping, AGI stability (no NaNs), revert heuristics.
    • Error handling: try/except + helpful print() on I/O; terminate with a trace on dtype mismatches; log warnings for recoverables.
    • Optional: monthly refresh via GitHub Actions.

Resources/Instructions

API docs to pin in repo

MediaWiki Action API (overview): https://www.mediawiki.org/wiki/API:Action_API
Query & continuation: https://www.mediawiki.org/wiki/API:Query
Revisions (timestamps, users, tags, comments): https://www.mediawiki.org/wiki/API:Revisions
Page info (watchers*, protection): https://www.mediawiki.org/wiki/API:Info

Wikimedia REST (pageviews)

Pageviews per-article: https://wikitech.wikimedia.org/wiki/Analytics/AQS/Pageviews

Suggested libraries (choose pairs)

  • HTTP: requests | httpx
  • DataFrames: pandas | polars
  • Storage: duckdb | sqlite
  • Viz: altair | plotly

Sample queries (copy to notes)

# Resolve pageids
action=query&titles=<TITLE>

# Latest/oldest revisions in a window (monthly aggregation from timestamps)
action=query&prop=revisions&titles=<TITLE>&rvprop=timestamp|user|comment|tags&rvlimit=max&rvstart=<ISO_END>&rvend=<ISO_START>

# Talk page revisions (oversight proxy)
action=query&prop=revisions&titles=Talk:<TITLE>&rvprop=timestamp|user|comment|tags&rvlimit=max&rvstart=<ISO_END>&rvend=<ISO_START>

# Page info (watchers*, may be restricted)
action=query&prop=info&inprop=watchers|visitingwatchers&titles=<TITLE>

# Pageviews (REST): monthly
/metrics/pageviews/per-article/en.wikipedia/all-access/user/<URL_ENCODED_TITLE>/monthly/<START>/<END>

Notes & ethics

  • Watcher counts may be unavailable or redacted; treat them as optional and prefer proxies (edits, unique editors, talk activity).

  • Aggregate reporting only; no editor-level profiling.

  • Spikes in attention don’t imply low quality or controversy—interpret in context.

  • If this issue requires access to 311 data, please answer the following questions:

    • Not applicable.
    • N/A
    • N/A
    • N/A

Project Outline (detailed plan for this idea) in details:

Research question Which sensitive-topic pages have high demand (views) but low oversight (watchers/edits/talk activity), and how persistent are these gaps?

Data sources & modules

  • Action API: prop=info (watchers*), prop=revisions (edits, tags), action=query for page/talk mapping.
  • REST: pageviews per-article (monthly).

Method

  1. Build seed_pages.csv; resolve pageid and associated Talk: titles.
  2. Pull monthly pageviews series.
  3. Pull monthly revision aggregates for article and talk page: total edits, unique editors, revert ratio.
  4. Where available, snapshot watchers/visitingwatchers; record nulls when restricted.
  5. Normalize metrics; compute AGI and flag top-decile gaps; compute persistence (months above threshold).
  6. Visualize AGI heatmap and per-page timelines; produce topic-level summaries.

Key metrics

  • Demand: views/month, spike index (view / rolling median).
  • Oversight: edits/month, unique editors/month, revert ratio, talk edits/month, watchers* (optional).
  • AGI and persistence: % months in top decile; max consecutive months.

Deliverables

  • Clean tables (pageviews.parquet, rev_monthly.parquet, talk_monthly.parquet, agi_monthly.parquet).
  • Notebook + reports/attention_asymmetry.md.
  • Streamlit/Altair dashboard with filters (topic, month, persistence threshold).

Caveats & limitations

  • Watchers often unavailable; proxies become primary.
  • Views can be news-driven and short-lived; report with spike-aware context.
  • Revert detection via tags/comments is heuristic; document FN/FP risk.

Implementation notes

  • Keys: (pageid, month); ensure UTC timestamps.
  • Persist query manifests and raw JSON; use polite throttling and continue tokens.
  • Use print() diagnostics on network errors; raise on dtype mismatches to avoid silent corruption.

chinaexpert1 avatar Sep 13 '25 22:09 chinaexpert1