Epic: MediaWiki Data Science Projects
Overview
I had GPT distill some interesting data science questions that can be answered using the MediaWiki API, particularly in culturally or economically sensitive areas. That response is in the collapsed section below. I have turned each one into sub-issues for assignment to DS CoP members. -chinaexpert1
Action Items
- [x] Review MediaWiki API site here
- [x] Make MediaWiki Project Template
- [x] Make all the sub-issues
- [ ] Assign to new and existing members after onboarding
Resources/Instructions
15 MediaWiki questions proposed by GPT:
Great target API. The MediaWiki Action API (plus Wikibase on Wikidata) exposes edits, logs, links, categories, language links, coordinates, and more—perfect for sensitive, policy-relevant analyses. Here are concrete, non-toy questions you can answer with specific modules and an analysis sketch:
-
Representation gaps in biographies • Question: How does the share of biographies by gender/region/occupation evolve over time? • How: Seed pages via
list=categorymembers, get revision timestamps viaprop=revisions, enrich with Wikidata entity attributes viaaction=wbgetentities(gender, country, occupation). Compare time trends and category coverage. ([MediaWiki]1) -
Language equity on sensitive topics • Question: Do key public-health, migration, or conflict pages exist and get updated across low-resource languages as quickly as in English? • How: For a topic set, pull interlanguage links with
prop=langlinks, then per language fetch latestprop=revisions(timestamps/size). Compute “time-to-translation” and update lag. ([MediaWiki]2) -
Controversy & protection dynamics • Question: Which policy-sensitive pages (e.g., policing, elections) see page protection spikes? • How: Use
prop=info&inprop=protectionto snapshot protection; uselist=logevents(letype=protect/unprotect) for change history; correlate with edit/revert bursts fromprop=revisions. ([MediaWiki]3) -
Deletion patterns and notability bias • Question: Are articles about marginalized communities more likely to be nominated or deleted? • How: Mine
list=logevents(letype=delete, move, restore) and—where permitted—prop=deletedrevisionsto mark outcomes; stratify by topic/region from categories/Wikidata. ([MediaWiki]4) -
External link economy (payday lenders, crypto exchanges, gig platforms) • Question: Which sensitive industries get linked most, and where? • How: Use
list=exturlusagefor target domains; map to pages and track link addition/removal via page revisions. ([MediaWiki]5) -
Edit-war & revert intensity on culturally charged pages • Question: Which pages have the highest revert ratios and by whom (anon vs registered)? • How: Parse
prop=revisionsfor users and SHA1; compute revert chains; optionally add ORES “damaging/goodfaith” scores for edit quality. ([MediaWiki]6) -
Gendered or respectful language drift over time • Question: Did terms like “illegal immigrant” → “undocumented immigrant” decline across articles? • How: Pull a page’s revision text via
prop=revisions(rvslots=main&rvprop=content) and run n-gram time-series; compare pre/post guidance changes. ([MediaWiki]6) -
Geospatial knowledge coverage vs. need • Question: Are locations in low-income regions under-documented? • How: Use GeoData (
prop=coordinates/list=geosearch) to map article density per area; normalize by population/economic indicators. ([MediaWiki]7) -
Topic-network bias and agenda setting • Question: Do sensitive pages link mostly to a narrow set of sources or topics? • How: Build link graphs with
list=backlinksandprop=links; compute centrality and modularity; flag echo chambers. ([MediaWiki]8) -
Watcher/attention asymmetry • Question: Do sensitive topics have unusually high/low watcher counts or strict protection? • How:
prop=info(watchers, when allowed) +inprop=protection; compare with edit volumes and revert rates. ([MediaWiki]3) -
Newcomer burden and gatekeeping • Question: Are newcomer edits on sensitive pages more likely to be reverted? • How: Join
list=usercontribsfor newcomer cohorts with page-levelprop=revisions; add ORES for damaging/goodfaith probabilities to quantify triage behavior. ([MediaWiki]9) -
Category-level under/over-coverage • Question: Which sensitive categories (e.g., Indigenous issues) are thin relative to peers? • How:
list=categorymemberscounts over time; compare article length, quality assessments (PageAssessments extension) and interlanguage presence. ([MediaWiki]1) -
Corporate pages around layoffs/scandals • Question: Do protection levels and revert rates spike around major corporate events? • How: For a company list, pull
prop=revisionstime-series andlist=logevents(protect/unprotect); test for structural breaks. ([MediaWiki]6) -
Cross-language topic presence for crisis terms • Question: Are emergency-response articles present/updated during disasters in smaller Wikipedias? • How: Use
prop=langlinksto enumerate languages per key page; monitorprop=revisionstimestamps per language. ([MediaWiki]2) -
Source-type auditing on sensitive pages • Question: Are certain outlet types (state media, tabloids) over-represented? • How: Extract references from wikitext in
prop=revisionsand join withlist=exturlusageto classify URLs. ([MediaWiki]6)
Notes & guardrails
• Many of these benefit from adding Wikidata attributes via the Wikibase API (action=wbgetentities). ([MediaWiki]10)
• Use protection and deletion data responsibly; some items are redacted/suppressed. Start with aggregated reporting and avoid profiling individual editors (follow Wikimedia norms). ([MediaWiki]11)
GPT can turn any of the above into a scoped GitHub Epic with tasks (data pull, ETL, metrics, dashboards) and ready-to-run notebooks - this might be particularly useful.
Additional Topics Proposed (if needed):
-
POV/Dispute Template Lifecycles Track when pages gain/lose templates like
{{POV}},{{Disputed}},{{Advert}},{{COI}}; measure dwell time and reoccurrence by topic. Modules:list=embeddedin,prop=revisions(content),prop=info. -
Redirect & Naming Neutrality Map redirect graphs for controversial terms (e.g., deprecated → neutral phrasing) and quantify which names “win” over time. Modules:
prop=redirects,list=backlinks(redirects only),prop=revisions. -
Image/Media License Equity (Commons) For sensitive pages, analyze how many images exist, where they come from, and license mix (CC BY/SA, ND/NC, PD). Modules: page
prop=images(enwiki) → Commonsimageinfo&prop=extmetadata. -
Citation Freshness & Archive Coverage Measure % dead or archived links, age of citations, and adoption of
|archive-url|archive-datein references. Modules:prop=revisions(content), optionallist=exturlusagefor present-day links. -
Protection Scope vs Topic Size Normalize protection prevalence by topic size: what share of pages in each sensitive topic are currently semi/full protected? Modules:
list=categorymembers,prop=info&inprop=protection. -
Infobox Completeness Gaps Within topic-specific infoboxes (e.g., health orgs, municipalities), quantify missing key fields across pages and regions. Modules:
prop=revisions(content), template parsing.
@salice @KarinaLopez19 @akhaleghi Please provide update
- Progress: "What is the current status of your project? What have you completed and what is left to do?"
- Blockers: "Difficulties or errors encountered."
- Availability: "How much time will you have this week to work on this issue?"
- ETA: "When do you expect this issue to be completed?"
- Pictures or links* (if necessary): "Add any pictures or links that will help illustrate what you are working on."
- remember to add links to the top of the issue if they are going to be needed again.