data-science icon indicating copy to clipboard operation
data-science copied to clipboard

MediaWiki API Project: Geospatial Knowledge Coverage vs. Need

Open chinaexpert1 opened this issue 4 months ago • 0 comments

Overview

Measure geospatial knowledge coverage vs. need on Wikipedia by mapping where geotagged articles exist and normalizing by population and economic indicators. Deliver a reproducible dataset and dashboard that highlight under-documented regions and topic gaps.

Action Items

If this is the beginning (research & design)

  • Define spatial units (pick 1 to start): country-level, H3 grid (global at res 5–6), or city radii (e.g., 20 km around major cities).

  • Select normalization signals: population (required), GDP per capita (optional), poverty rate (optional). Prefer public sources (World Bank API) and/or Wikidata population (P1082) at country/city levels.

  • Topic taxonomy: settlements, natural features, culture/heritage, health/education, infrastructure (map via prop=categories and/or Wikidata P31).

  • Methods choice:

    • Track A (Action API first): tile world with H3 cells → list=geosearch sampling → dedupe pages → prop=coordinates and prop=categories.
    • Track B (hybrid with Wikidata): pull items with coordinates (P625), country (P17), population (P1082), instance of (P31) → map to enwiki sitelinks.
  • Tooling (choose pairs): requests or httpx; pandas or polars; duckdb or sqlite; H3 or geohash; altair or plotly. Optional: geopandas or shapely for country joins.

If researched and ready (implementation steps)

  1. Seed spatial units

    • H3 grid: generate global cells at chosen resolution; store centroids + radius.
    • Country mode: list ISO countries (csv) for joins; optional Natural Earth boundaries if doing PIP offline.
  2. Collect geotagged pages

    • H3 sampling: for each cell center → list=geosearch (gscoord, gsradius, gslimit=max), dedupe by pageid.
    • Enrich: prop=coordinates (type, dim) and prop=categories&clshow=!hidden for topic tags.
    • Map to QIDs: prop=pageprops&ppprop=wikibase_item → Wikidata wbgetentities for P31 (topic) and P17/P131 (location) when available.
  3. Normalization data

    • Country-level: population, GDP per capita from World Bank API or Wikidata; city-level: population via Wikidata P1082. Cache to norm_metrics.parquet.
  4. Feature engineering

    • Per unit (country/H3/city): geotagged_article_count, per-topic counts, articles per 1M people, articles per 1k km², and coverage z-scores within region/income group.
  5. Analysis

    • Identify under-documented areas (bottom decile per capita coverage), topic skews, and correlations with GDPpc/poverty. Produce hotspot maps and league tables.
  6. Deliver

    • Artifacts: pages_geo.parquet, unit_summary.parquet, topic_summary.parquet, norm_metrics.parquet, metrics.csv.
    • Dashboard: choropleth (country) or H3 hex map, topic mix bars, and “most under-covered” table.
    • Methods README with API params, sampling trade-offs, and limitations.
  7. Quality & Ops

    • Caching; retries with exponential backoff; persist continue tokens.
    • Tests: dedupe integrity, cell-to-page coverage rate, topic mapping precision on a labeled sample.
    • Error handling: try/except with clear print() on file-not-found; terminate with a trace on dtype mismatches; log warnings for recoverable issues.
    • Optional scheduled monthly refresh (GitHub Actions).

Resources/Instructions

MediaWiki / Wikidata docs

  • Action API overview: API:Action_API
  • Geosearch: API:Geosearch
  • Coordinates: API:Coordinates
  • Categories: API:Categories
  • Pageprops (QID): API:Pageprops
  • Wikidata API: wbgetentities (for P31/P17/P1082)
  • World Bank API (optional normalization): Indicators for population and GDP per capita

Suggested libraries (choose pairs)

  • HTTP: requests | httpx
  • DataFrames: pandas | polars
  • Spatial index: h3 | pygeohash
  • Storage: duckdb | sqlite
  • Viz: altair | plotly
  • (Optional) Geospatial: geopandas | shapely

Sample queries (copy to notes)

# Geosearch around a cell centroid (10 km radius)
action=query&list=geosearch&gscoord=<LAT>|<LON>&gsradius=10000&gslimit=max

# Get coordinates for a batch of pages
action=query&prop=coordinates&titles=<TITLE1>|<TITLE2>|...&colimit=max

# Categories for topic mapping (exclude hidden)
action=query&prop=categories&clshow=!hidden&titles=<TITLE>

# Page → Wikidata QID
action=query&prop=pageprops&ppprop=wikibase_item&pageids=<PAGEID>

# Wikidata entity lookup (topic, country, population)
action=wbgetentities&ids=<QID>&props=claims|labels|sitelinks

Ethics & reporting

  • Focus on aggregate coverage; do not profile individual editors or communities.

  • Be explicit: article counts ≠ content quality; coverage ≠ endorsement.

  • Document sampling bias (H3 gaps, language bias) and missingness (uncoordinated pages).

  • If this issue requires access to 311 data, please answer the following questions:

    • Not applicable.
    • N/A
    • N/A
    • N/A

Project Outline (detailed plan for this idea) in details:

Research question Are locations in low-income or underserved regions under-documented on Wikipedia relative to population and geography, and which topics are most lacking?

Data sources & modules

  • Action API: list=geosearch, prop=coordinates, prop=categories, prop=pageprops.
  • Wikidata: wbgetentities for P31 (instance of), P17 (country), P1082 (population).
  • Normalizers: World Bank population/GDPpc (or Wikidata where available).

Method

  1. Choose spatial unit (country or H3).
  2. Collect geotagged pages via geosearch sampling; dedupe by pageid.
  3. Enrich with categories and Wikidata P31/P17; assign each page to a topic bucket and a spatial unit.
  4. Join population/GDPpc; compute per-capita and per-area article densities and topic shares.
  5. Rank units by under-/over-coverage; compute correlations with GDPpc/poverty; visualize gaps.

Key metrics

  • Articles per 1M people; articles per 1k km².
  • Topic mix entropy by region; deficit index vs median of peer income group.
  • Top under-covered regions (overall and per topic).

Deliverables

  • Clean tables (pages_geo.parquet, unit_summary.parquet, topic_summary.parquet, norm_metrics.parquet).
  • Reproducible notebook + reports/geo_coverage_vs_need.md.
  • Dashboard: map + tables + topic filters.

Caveats & limitations

  • Geosearch sampling may miss short-lived or uncategorized pages; language coverage is uneven.
  • Country joins via P17/P131 can be incomplete; document fallback logic.
  • Per-capita normalization depends on the chosen year/source; pin the version and cite.

Implementation notes

  • Keys: (pageid) for pages; (h3,res) or (iso3) for units.
  • Maintain a query manifest; cache raw responses.
  • Provide a small hand-labeled set to validate topic mapping rules; report precision/recall.

chinaexpert1 avatar Sep 13 '25 21:09 chinaexpert1