MediaWiki API Project: Geospatial Knowledge Coverage vs. Need

Open chinaexpert1 opened this issue 4 months ago • 0 comments

Overview

Measure geospatial knowledge coverage vs. need on Wikipedia by mapping where geotagged articles exist and normalizing by population and economic indicators. Deliver a reproducible dataset and dashboard that highlight under-documented regions and topic gaps.

Action Items

If this is the beginning (research & design)

Define spatial units (pick 1 to start): country-level, H3 grid (global at res 5–6), or city radii (e.g., 20 km around major cities).
Select normalization signals: population (required), GDP per capita (optional), poverty rate (optional). Prefer public sources (World Bank API) and/or Wikidata population (P1082) at country/city levels.
Topic taxonomy: settlements, natural features, culture/heritage, health/education, infrastructure (map via prop=categories and/or Wikidata P31).
Methods choice:
- Track A (Action API first): tile world with H3 cells → list=geosearch sampling → dedupe pages → prop=coordinates and prop=categories.
- Track B (hybrid with Wikidata): pull items with coordinates (P625), country (P17), population (P1082), instance of (P31) → map to enwiki sitelinks.
Tooling (choose pairs): requests or httpx; pandas or polars; duckdb or sqlite; H3 or geohash; altair or plotly. Optional: geopandas or shapely for country joins.

If researched and ready (implementation steps)

Seed spatial units
- H3 grid: generate global cells at chosen resolution; store centroids + radius.
- Country mode: list ISO countries (csv) for joins; optional Natural Earth boundaries if doing PIP offline.
Collect geotagged pages
- H3 sampling: for each cell center → list=geosearch (gscoord, gsradius, gslimit=max), dedupe by pageid.
- Enrich: prop=coordinates (type, dim) and prop=categories&clshow=!hidden for topic tags.
- Map to QIDs: prop=pageprops&ppprop=wikibase_item → Wikidata wbgetentities for P31 (topic) and P17/P131 (location) when available.
Normalization data
- Country-level: population, GDP per capita from World Bank API or Wikidata; city-level: population via Wikidata P1082. Cache to norm_metrics.parquet.
Feature engineering
- Per unit (country/H3/city): geotagged_article_count, per-topic counts, articles per 1M people, articles per 1k km², and coverage z-scores within region/income group.
Analysis
- Identify under-documented areas (bottom decile per capita coverage), topic skews, and correlations with GDPpc/poverty. Produce hotspot maps and league tables.
Deliver
- Artifacts: pages_geo.parquet, unit_summary.parquet, topic_summary.parquet, norm_metrics.parquet, metrics.csv.
- Dashboard: choropleth (country) or H3 hex map, topic mix bars, and “most under-covered” table.
- Methods README with API params, sampling trade-offs, and limitations.
Quality & Ops
- Caching; retries with exponential backoff; persist continue tokens.
- Tests: dedupe integrity, cell-to-page coverage rate, topic mapping precision on a labeled sample.
- Error handling: try/except with clear print() on file-not-found; terminate with a trace on dtype mismatches; log warnings for recoverable issues.
- Optional scheduled monthly refresh (GitHub Actions).

Resources/Instructions

MediaWiki / Wikidata docs

Action API overview: API:Action_API
Geosearch: API:Geosearch
Coordinates: API:Coordinates
Categories: API:Categories
Pageprops (QID): API:Pageprops
Wikidata API: wbgetentities (for P31/P17/P1082)
World Bank API (optional normalization): Indicators for population and GDP per capita

Suggested libraries (choose pairs)

HTTP: requests | httpx
DataFrames: pandas | polars
Spatial index: h3 | pygeohash
Storage: duckdb | sqlite
Viz: altair | plotly
(Optional) Geospatial: geopandas | shapely

Sample queries (copy to notes)

# Geosearch around a cell centroid (10 km radius)
action=query&list=geosearch&gscoord=<LAT>|<LON>&gsradius=10000&gslimit=max

# Get coordinates for a batch of pages
action=query&prop=coordinates&titles=<TITLE1>|<TITLE2>|...&colimit=max

# Categories for topic mapping (exclude hidden)
action=query&prop=categories&clshow=!hidden&titles=<TITLE>

# Page → Wikidata QID
action=query&prop=pageprops&ppprop=wikibase_item&pageids=<PAGEID>

# Wikidata entity lookup (topic, country, population)
action=wbgetentities&ids=<QID>&props=claims|labels|sitelinks

Ethics & reporting

Focus on aggregate coverage; do not profile individual editors or communities.
Be explicit: article counts ≠ content quality; coverage ≠ endorsement.
Document sampling bias (H3 gaps, language bias) and missingness (uncoordinated pages).
If this issue requires access to 311 data, please answer the following questions:
- Not applicable.
- N/A
- N/A
- N/A

Project Outline (detailed plan for this idea) in details:

Research question Are locations in low-income or underserved regions under-documented on Wikipedia relative to population and geography, and which topics are most lacking?

Data sources & modules

Action API: list=geosearch, prop=coordinates, prop=categories, prop=pageprops.
Wikidata: wbgetentities for P31 (instance of), P17 (country), P1082 (population).
Normalizers: World Bank population/GDPpc (or Wikidata where available).

Method

Choose spatial unit (country or H3).
Collect geotagged pages via geosearch sampling; dedupe by pageid.
Enrich with categories and Wikidata P31/P17; assign each page to a topic bucket and a spatial unit.
Join population/GDPpc; compute per-capita and per-area article densities and topic shares.
Rank units by under-/over-coverage; compute correlations with GDPpc/poverty; visualize gaps.

Key metrics

Articles per 1M people; articles per 1k km².
Topic mix entropy by region; deficit index vs median of peer income group.
Top under-covered regions (overall and per topic).

Deliverables

Clean tables (pages_geo.parquet, unit_summary.parquet, topic_summary.parquet, norm_metrics.parquet).
Reproducible notebook + reports/geo_coverage_vs_need.md.
Dashboard: map + tables + topic filters.

Caveats & limitations

Geosearch sampling may miss short-lived or uncategorized pages; language coverage is uneven.
Country joins via P17/P131 can be incomplete; document fallback logic.
Per-capita normalization depends on the chosen year/source; pin the version and cite.

Implementation notes

Keys: (pageid) for pages; (h3,res) or (iso3) for units.
Maintain a query manifest; cache raw responses.
Provide a small hand-labeled set to validate topic mapping rules; report precision/recall.

Sep 13 '25 21:09 chinaexpert1