MediaWiki API Project: Watcher/Attention Symmetry
Overview
Quantify watcher/attention asymmetry on sensitive Wikipedia pages by comparing demand signals (pageviews) to supply/oversight signals (watchers*, edits, unique editors, talk-page activity). Deliver a reproducible dataset and dashboard that spotlight pages with high demand but low oversight—and vice versa. *Watcher counts are sometimes restricted; we’ll fall back to proxy metrics when unavailable.
Action Items
If this is the beginning (research & design)
- Scope: 100–300 English Wikipedia pages across elections, policing, migration, public health, human rights, climate. Save as
seed_pages.csv(title,pageid,qid,topic). - Metrics & windows (suggested): monthly from 2019 → present.
- Attention metrics: pageviews (REST API), external referrals (optional), trending spikes.
- Oversight metrics: watchers* (if available), edits/month, unique editors/month, revert ratio, talk-page edits/month.
- Define an Attention Gap Index (AGI): demand ÷ oversight (normalized ranks or z-scores).
- Libraries (pick pairs):
requestsorhttpx;pandasorpolars;duckdborsqlite;altairorplotly.
If researched and ready (implementation steps)
-
Seed & resolve
- Resolve
pageidviaaction=query&titles=<title>, track redirects; store(title,pageid,qid,topic).
- Resolve
-
Demand (attention) pulls
- Pageviews: Wikimedia REST
pageviews/per-articlemonthly series for each page; compute level, volatility, and spikes (e.g., 95th pct).
- Pageviews: Wikimedia REST
-
Oversight pulls
- Try
prop=info&inprop=watchers|visitingwatchers(note: may be restricted/not returned). - Always compute proxies: monthly edits (
prop=revisions), unique editors, revert ratio (comment/tag heuristics), and talk-page edits (Talk:<title>).
- Try
-
Features & index
- Normalize each metric per-month (z-scores or percentile ranks within topic).
- AGI = demand_norm / oversight_norm (guard against div-by-zero; add ε).
- Label pages with persistent high AGI (e.g., top decile for ≥3 months).
-
Deliver
- Artifacts:
pageviews.parquet,rev_monthly.parquet,talk_monthly.parquet,watchers_snapshot.parquet(when available),agi_monthly.parquet,metrics.csv. - Dashboard: per-page timelines (views vs edits), AGI heatmap (pages × months), top-pages table with filters (topic, month).
- Methods README: data gaps, watcher restrictions, formulas, and caveats.
- Artifacts:
-
Quality & Ops
- Caching and retries with exponential backoff; honor
maxlag. - Tests: continuation handling, talk-page mapping, AGI stability (no NaNs), revert heuristics.
- Error handling:
try/except+ helpfulprint()on I/O; terminate with a trace on dtype mismatches; log warnings for recoverables. - Optional: monthly refresh via GitHub Actions.
- Caching and retries with exponential backoff; honor
Resources/Instructions
API docs to pin in repo
MediaWiki Action API (overview): https://www.mediawiki.org/wiki/API:Action_API
Query & continuation: https://www.mediawiki.org/wiki/API:Query
Revisions (timestamps, users, tags, comments): https://www.mediawiki.org/wiki/API:Revisions
Page info (watchers*, protection): https://www.mediawiki.org/wiki/API:Info
Wikimedia REST (pageviews)
Pageviews per-article: https://wikitech.wikimedia.org/wiki/Analytics/AQS/Pageviews
Suggested libraries (choose pairs)
- HTTP:
requests|httpx - DataFrames:
pandas|polars - Storage:
duckdb|sqlite - Viz:
altair|plotly
Sample queries (copy to notes)
# Resolve pageids
action=query&titles=<TITLE>
# Latest/oldest revisions in a window (monthly aggregation from timestamps)
action=query&prop=revisions&titles=<TITLE>&rvprop=timestamp|user|comment|tags&rvlimit=max&rvstart=<ISO_END>&rvend=<ISO_START>
# Talk page revisions (oversight proxy)
action=query&prop=revisions&titles=Talk:<TITLE>&rvprop=timestamp|user|comment|tags&rvlimit=max&rvstart=<ISO_END>&rvend=<ISO_START>
# Page info (watchers*, may be restricted)
action=query&prop=info&inprop=watchers|visitingwatchers&titles=<TITLE>
# Pageviews (REST): monthly
/metrics/pageviews/per-article/en.wikipedia/all-access/user/<URL_ENCODED_TITLE>/monthly/<START>/<END>
Notes & ethics
-
Watcher counts may be unavailable or redacted; treat them as optional and prefer proxies (edits, unique editors, talk activity).
-
Aggregate reporting only; no editor-level profiling.
-
Spikes in attention don’t imply low quality or controversy—interpret in context.
-
If this issue requires access to 311 data, please answer the following questions:
- Not applicable.
- N/A
- N/A
- N/A
Project Outline (detailed plan for this idea) in details:
Research question Which sensitive-topic pages have high demand (views) but low oversight (watchers/edits/talk activity), and how persistent are these gaps?
Data sources & modules
- Action API:
prop=info(watchers*),prop=revisions(edits, tags),action=queryfor page/talk mapping. - REST: pageviews per-article (monthly).
Method
- Build
seed_pages.csv; resolvepageidand associatedTalk:titles. - Pull monthly pageviews series.
- Pull monthly revision aggregates for article and talk page: total edits, unique editors, revert ratio.
- Where available, snapshot watchers/visitingwatchers; record nulls when restricted.
- Normalize metrics; compute AGI and flag top-decile gaps; compute persistence (months above threshold).
- Visualize AGI heatmap and per-page timelines; produce topic-level summaries.
Key metrics
- Demand: views/month, spike index (view / rolling median).
- Oversight: edits/month, unique editors/month, revert ratio, talk edits/month, watchers* (optional).
- AGI and persistence: % months in top decile; max consecutive months.
Deliverables
- Clean tables (
pageviews.parquet,rev_monthly.parquet,talk_monthly.parquet,agi_monthly.parquet). - Notebook +
reports/attention_asymmetry.md. - Streamlit/Altair dashboard with filters (topic, month, persistence threshold).
Caveats & limitations
- Watchers often unavailable; proxies become primary.
- Views can be news-driven and short-lived; report with spike-aware context.
- Revert detection via tags/comments is heuristic; document FN/FP risk.
Implementation notes
- Keys:
(pageid, month); ensure UTC timestamps. - Persist query manifests and raw JSON; use polite throttling and
continuetokens. - Use
print()diagnostics on network errors; raise on dtype mismatches to avoid silent corruption.