MediaWiki API Project: Geospatial Knowledge Coverage vs. Need
Overview
Measure geospatial knowledge coverage vs. need on Wikipedia by mapping where geotagged articles exist and normalizing by population and economic indicators. Deliver a reproducible dataset and dashboard that highlight under-documented regions and topic gaps.
Action Items
If this is the beginning (research & design)
-
Define spatial units (pick 1 to start): country-level, H3 grid (global at res 5–6), or city radii (e.g., 20 km around major cities).
-
Select normalization signals: population (required), GDP per capita (optional), poverty rate (optional). Prefer public sources (World Bank API) and/or Wikidata population (P1082) at country/city levels.
-
Topic taxonomy: settlements, natural features, culture/heritage, health/education, infrastructure (map via
prop=categoriesand/or Wikidata P31). -
Methods choice:
-
Track A (Action API first): tile world with H3 cells →
list=geosearchsampling → dedupe pages →prop=coordinatesandprop=categories. - Track B (hybrid with Wikidata): pull items with coordinates (P625), country (P17), population (P1082), instance of (P31) → map to enwiki sitelinks.
-
Track A (Action API first): tile world with H3 cells →
-
Tooling (choose pairs):
requestsorhttpx;pandasorpolars;duckdborsqlite; H3 or geohash;altairorplotly. Optional:geopandasorshapelyfor country joins.
If researched and ready (implementation steps)
-
Seed spatial units
- H3 grid: generate global cells at chosen resolution; store centroids + radius.
- Country mode: list ISO countries (csv) for joins; optional Natural Earth boundaries if doing PIP offline.
-
Collect geotagged pages
- H3 sampling: for each cell center →
list=geosearch(gscoord,gsradius,gslimit=max), dedupe bypageid. - Enrich:
prop=coordinates(type, dim) andprop=categories&clshow=!hiddenfor topic tags. - Map to QIDs:
prop=pageprops&ppprop=wikibase_item→ Wikidatawbgetentitiesfor P31 (topic) and P17/P131 (location) when available.
- H3 sampling: for each cell center →
-
Normalization data
- Country-level: population, GDP per capita from World Bank API or Wikidata; city-level: population via Wikidata P1082. Cache to
norm_metrics.parquet.
- Country-level: population, GDP per capita from World Bank API or Wikidata; city-level: population via Wikidata P1082. Cache to
-
Feature engineering
- Per unit (country/H3/city):
geotagged_article_count, per-topic counts, articles per 1M people, articles per 1k km², and coverage z-scores within region/income group.
- Per unit (country/H3/city):
-
Analysis
- Identify under-documented areas (bottom decile per capita coverage), topic skews, and correlations with GDPpc/poverty. Produce hotspot maps and league tables.
-
Deliver
- Artifacts:
pages_geo.parquet,unit_summary.parquet,topic_summary.parquet,norm_metrics.parquet,metrics.csv. - Dashboard: choropleth (country) or H3 hex map, topic mix bars, and “most under-covered” table.
- Methods README with API params, sampling trade-offs, and limitations.
- Artifacts:
-
Quality & Ops
- Caching; retries with exponential backoff; persist
continuetokens. - Tests: dedupe integrity, cell-to-page coverage rate, topic mapping precision on a labeled sample.
- Error handling:
try/exceptwith clearprint()on file-not-found; terminate with a trace on dtype mismatches; log warnings for recoverable issues. - Optional scheduled monthly refresh (GitHub Actions).
- Caching; retries with exponential backoff; persist
Resources/Instructions
MediaWiki / Wikidata docs
- Action API overview:
API:Action_API - Geosearch:
API:Geosearch - Coordinates:
API:Coordinates - Categories:
API:Categories - Pageprops (QID):
API:Pageprops - Wikidata API:
wbgetentities(for P31/P17/P1082) - World Bank API (optional normalization): Indicators for population and GDP per capita
Suggested libraries (choose pairs)
- HTTP:
requests|httpx - DataFrames:
pandas|polars - Spatial index:
h3|pygeohash - Storage:
duckdb|sqlite - Viz:
altair|plotly - (Optional) Geospatial:
geopandas|shapely
Sample queries (copy to notes)
# Geosearch around a cell centroid (10 km radius)
action=query&list=geosearch&gscoord=<LAT>|<LON>&gsradius=10000&gslimit=max
# Get coordinates for a batch of pages
action=query&prop=coordinates&titles=<TITLE1>|<TITLE2>|...&colimit=max
# Categories for topic mapping (exclude hidden)
action=query&prop=categories&clshow=!hidden&titles=<TITLE>
# Page → Wikidata QID
action=query&prop=pageprops&ppprop=wikibase_item&pageids=<PAGEID>
# Wikidata entity lookup (topic, country, population)
action=wbgetentities&ids=<QID>&props=claims|labels|sitelinks
Ethics & reporting
-
Focus on aggregate coverage; do not profile individual editors or communities.
-
Be explicit: article counts ≠ content quality; coverage ≠ endorsement.
-
Document sampling bias (H3 gaps, language bias) and missingness (uncoordinated pages).
-
If this issue requires access to 311 data, please answer the following questions:
- Not applicable.
- N/A
- N/A
- N/A
Project Outline (detailed plan for this idea) in details:
Research question Are locations in low-income or underserved regions under-documented on Wikipedia relative to population and geography, and which topics are most lacking?
Data sources & modules
- Action API:
list=geosearch,prop=coordinates,prop=categories,prop=pageprops. - Wikidata:
wbgetentitiesfor P31 (instance of), P17 (country), P1082 (population). - Normalizers: World Bank population/GDPpc (or Wikidata where available).
Method
- Choose spatial unit (country or H3).
- Collect geotagged pages via geosearch sampling; dedupe by
pageid. - Enrich with categories and Wikidata P31/P17; assign each page to a topic bucket and a spatial unit.
- Join population/GDPpc; compute per-capita and per-area article densities and topic shares.
- Rank units by under-/over-coverage; compute correlations with GDPpc/poverty; visualize gaps.
Key metrics
- Articles per 1M people; articles per 1k km².
- Topic mix entropy by region; deficit index vs median of peer income group.
- Top under-covered regions (overall and per topic).
Deliverables
- Clean tables (
pages_geo.parquet,unit_summary.parquet,topic_summary.parquet,norm_metrics.parquet). - Reproducible notebook +
reports/geo_coverage_vs_need.md. - Dashboard: map + tables + topic filters.
Caveats & limitations
- Geosearch sampling may miss short-lived or uncategorized pages; language coverage is uneven.
- Country joins via P17/P131 can be incomplete; document fallback logic.
- Per-capita normalization depends on the chosen year/source; pin the version and cite.
Implementation notes
- Keys:
(pageid)for pages;(h3,res)or(iso3)for units. - Maintain a query manifest; cache raw responses.
- Provide a small hand-labeled set to validate topic mapping rules; report precision/recall.