data-science icon indicating copy to clipboard operation
data-science copied to clipboard

MediaWiki API Project: External Links to Sensitive Industries

Open chinaexpert1 opened this issue 4 months ago • 0 comments

Overview

Map the external link economy on Wikipedia for sensitive industries (e.g., payday lenders, crypto exchanges, gig platforms) using the MediaWiki Action API. Quantify which domains get linked, where links appear, and how often links are added vs. removed over time.

Action Items

If this is the beginning (research & design)

  • Define scope & taxonomy: three focus buckets (payday lending, crypto exchanges, gig platforms). Create domains.yaml with canonical domains and regex variants (e.g., subdomains, tracking params).
  • Coverage & windows: English Wikipedia, Namespace 0 (articles). Time horizon: monthly snapshots from 2019 → present (adjustable).
  • Metrics: link count, unique pages per domain, adds vs. removals, link “half-life,” top pages & categories linking to each domain, per-topic concentration.
  • Methods: use list=exturlusage to enumerate where a domain is linked (current state). For change over time, fetch revision history for those pages and diff URLs across monthly waypoints.
  • Tooling (pick pairs and keep consistent): requests or httpx; pandas or polars; duckdb or sqlite; altair or plotly; parsing via mwparserfromhell or robust URL regex.

If researched and ready (implementation steps)

  1. Ingest (current links)

    • For each domain pattern in domains.yaml, call list=exturlusage with euquery=<domain>, euprotocol=https, eunamespace=0, paginate with continue.
    • Persist: (domain, pageid, title, url, first_seen_ts=null for now).
  2. Historical change detection

    • For each (pageid, domain) pair, pull revision timestamps over the analysis window via prop=revisions&rvprop=ids|timestamp|comment|size&rvlimit=max&rvstart/rvend.
    • Sampling strategy to control load: monthly oldest + newest per month (or every Nth rev). For each sampled rev, fetch content (rvslots=main&rvprop=content) and extract external URLs; mark presence/absence of the domain.
    • Derive add/remove events by comparing adjacent sampled revisions; estimate first_seen_ts and last_seen_ts per (pageid, domain).
  3. Topic/context enrichment

    • Fetch categories for pages (prop=categories&clshow=!hidden) to roll up into coarse topics (finance, labor, politics). Save page_topics.parquet.
  4. Feature engineering

    • Per domain and month: total links, unique pages, additions, removals, net change, survival % (share of links persisting ≥90 days), Gini/entropy of topic distribution.
  5. Analysis

    • Identify top-linked domains per bucket; pages driving the most additions; spikes/declines; domains with high churn vs durable links.
  6. Deliver

    • Artifacts: extlinks_current.parquet, extlinks_events.parquet, domain_monthly.parquet, page_topics.parquet, metrics.csv.
    • Dashboard: domain trends, adds vs removals, top pages, topic distribution.
    • Methods README with API params, sampling trade-offs, and limitations.
  7. Quality & Ops

    • Caching, retries with exponential backoff; store raw JSON.
    • Tests: URL extractor correctness, diff logic, pagination continuity.
    • Optional: monthly GitHub Action to refresh deltas.

Resources/Instructions

API docs to pin in repo

  • Action API overview: API:Action_API
  • External link usage (sitewide by domain): API:Exturlusage
  • Revisions (timestamps, content): API:Revisions
  • Categories: API:Categories
  • Query continuation & etiquette (maxlag): API:Query

Suggested libraries (choose pairs)

  • HTTP: requests | httpx
  • DataFrames: pandas | polars
  • Storage: duckdb | sqlite
  • Viz: altair | plotly
  • Parsing: mwparserfromhell | regex + urllib.parse

Sample queries (copy to notes)

# Where is a domain linked (current state)?
action=query&list=exturlusage&euquery=example.com&euprotocol=https&eunamespace=0&eulimit=max

# Revisions for a page in a window (ids + timestamps)
action=query&prop=revisions&rvprop=ids|timestamp|comment|size&rvlimit=max&rvstart=2025-09-01T00:00:00Z&rvend=2019-01-01T00:00:00Z&pageids=<PAGEID>

# Fetch content for a specific revision (to extract URLs)
action=query&prop=revisions&revids=<REVID>&rvslots=main&rvprop=content

Data handling & ethics

  • Aggregate reporting; no editor-level profiling; avoid naming individual contributors.

  • Be explicit that links ≠ endorsement and that some links originate in citations.

  • Respect rate limits; persist continue tokens; log parameters per request.

  • Record known gaps: some URLs may appear only inside citation templates or be obfuscated; sampling may miss short-lived links.

  • If this issue requires access to 311 data, please answer the following questions:

    • Not applicable.
    • N/A
    • N/A
    • N/A

Project Outline (detailed plan for this idea) in details:

Research question Which sensitive-industry domains receive Wikipedia links, in which article topics, and how do link additions/removals evolve over time?

Data sources & modules

  • list=exturlusage to enumerate pages and URLs per target domain (current).
  • prop=revisions (+ content) to reconstruct add/remove events over time.
  • prop=categories to classify page topics.

Method

  1. Build domains.yaml with canonical roots and regex variants for (a) payday lending, (b) crypto exchanges, (c) gig platforms.
  2. Use exturlusage to snapshot current links sitewide; deduplicate (pageid, url).
  3. For each linked page, sample revisions across months; fetch wikitext content and extract URLs; compute presence matrices per month; diff to detect adds/removes and estimate first/last seen times.
  4. Join page categories → topic buckets; aggregate per domain × month: links, unique pages, adds, removes, net, churn, survival.
  5. Visualize domain trends and topic concentration; highlight pages and months with the largest swings.

Key metrics

  • Links/month, unique pages/month, and net change.
  • Add/remove ratio (churn); median link survival time; survival @ 90 days.
  • Topic entropy (how concentrated links are across topics).
  • Top pages and domains by additions, removals, and persistence.

Deliverables

  • Clean tables (extlinks_current.parquet, extlinks_events.parquet, domain_monthly.parquet, page_topics.parquet).
  • Reproducible notebook + reports/external_link_economy.md.
  • Streamlit/Altair dashboard (trend lines, adds/removals bars, top N lists, topic treemap).

Caveats & limitations

  • exturlusage reflects the current state; historical events require revision sampling and may miss very transient links.
  • URLs inside templates/citations can be harder to parse consistently; validate extractor on a labeled subset and report precision/recall.
  • Some domains use redirects/URL shorteners; expand where feasible and document assumptions.

Implementation notes

  • Keys: prefer pageid + normalized URL (scheme/host/path without query tracking).
  • Error handling: try/except with clear print() messages for missing files; terminate with a trace on dtype mismatches.
  • Save run config, git commit hash, and seeds in artifact metadata; cache raw API responses and a query manifest for reproducibility.

chinaexpert1 avatar Sep 13 '25 21:09 chinaexpert1