MediaWiki API Project: Underrepresented Groups
Overview
Measure representation gaps in Wikipedia biographies by gender, region, and occupation using the MediaWiki Action API (plus Wikidata), and track how those shares change over time. The output is a reproducible dataset, metrics, and a small dashboard highlighting trends and gaps.
Action Items
If this is the beginning (research & design)
- Define scope: start with English Wikipedia biographies; attributes via Wikidata properties (gender
P21, country of citizenshipP27, occupationP106; avoid inferring sensitive attributes). - Time windows: monthly counts from 2015 → present (adjustable).
- Sampling plan: gather page lists from biography categories; map each page to its Wikidata QID; fetch attributes; collect creation date and selected revision timestamps.
- Schema: per-article row (pageid, title, qid, first_edit_ts, latest_edit_ts, gender, country, occupation).
- Tooling (pick one per pair):
requestsorhttpx;pandasorpolars;mwparserfromhellorwikitextparser;duckdborsqlite;altairormatplotlib. - Ethics & governance: aggregate reporting only; no editor-level analysis; document data caveats (missing/ambiguous attributes).
If researched and ready (implementation steps)
-
Data acquisition
- Enumerate biography pages via
list=categorymemberson seed categories (e.g.,Category:Living people,Category:20th-century births, and WikiProject Biography subcats). - For each
pageid, fetch QID viaprop=pageprops(wikibase_item), then call Wikidatawbgetentitiesto retrieveP21,P27,P106. - Fetch page creation date via
prop=revisions&rvlimit=1&rvdir=newerand (optionally) monthly snapshots viarvstart/rvend.
- Enumerate biography pages via
-
Transform
- Normalize Wikidata values to readable labels; map countries to regions; standardize occupations to a small set (e.g., tech, arts, politics, sports, science).
- Build monthly aggregates: counts and shares by gender, region, occupation; growth rates and net-new pages.
-
Analyze
- KPIs: share by attribute over time; YoY change; Gini/entropy of occupation mix; coverage by region vs global population (optional).
- Confidence: bootstrap intervals on shares; flag small-n months.
-
Deliver
- Artifacts:
biographies_raw.parquet,biographies_monthly.parquet,metrics.csv. - Dashboard (Altair/Streamlit): time series, small-multiple by attribute; top deltas.
- Methods README with API queries, caveats, and reproducibility notes.
- Artifacts:
-
Quality & Ops
- Add retry/backoff; cache responses; save query manifests.
- Tests: schema checks, null-rate thresholds, and attribute mapping tests.
- Optional monthly refresh via GitHub Actions (scheduled workflow).
Resources/Instructions
Docs & endpoints (put these in your repo README)
MediaWiki Action API (overview): https://www.mediawiki.org/wiki/API:Action_API
Category members: https://www.mediawiki.org/wiki/API:Categorymembers
Revisions: https://www.mediawiki.org/wiki/API:Revisions
Pageprops (get Wikidata QID): https://www.mediawiki.org/wiki/API:Pageprops
English Wikipedia endpoint: https://en.wikipedia.org/w/api.php
Wikidata (Wikibase API): https://www.wikidata.org/w/api.php
wbgetentities: https://www.wikidata.org/wiki/Special:ApiHelp/wbgetentities
Suggested libraries (choose pairs where helpful)
- HTTP:
requests|httpx - Frames:
pandas|polars - Parsing:
mwparserfromhell|wikitextparser - Storage:
duckdb|sqlite - Viz:
altair|matplotlib
Sample queries to copy into your notes
# List category members (biographies seed)
action=query&list=categorymembers&cmtitle=Category:Living_people&cmlimit=500
# Page → Wikidata QID
action=query&prop=pageprops&ppprop=wikibase_item&pageids=<PAGEID>
# Page creation timestamp
action=query&prop=revisions&rvprop=timestamp&rvlimit=1&rvdir=newer&pageids=<PAGEID>
# Wikidata entity fetch
action=wbgetentities&ids=Q42&props=claims|labels
Data handling notes
-
Respect API etiquette: batching,
maxlag, and backoff on errors. -
Store raw JSON responses for reprocessing; log all parameters per request.
-
Avoid person-level conclusions; report only aggregates (≥20 items per cell recommended).
-
If this issue requires access to 311 data, please answer the following questions:
- Not applicable.
- N/A
- N/A
- N/A
Project Outline (detailed plan for this idea) is in the collapsed details below:
Research question How does the share of Wikipedia biographies by gender, region, and occupation evolve over time, and where are the largest representation gaps?
Data sources & modules
-
list=categorymembersto seed biography pages. -
prop=pagepropsto getwikibase_item(QID). -
action=wbgetentitiesto fetch attributes (P21gender,P27country,P106occupation). -
prop=revisionsfor creation and monthly timestamps (optional).
Method
- Build the article universe from biography categories (document your exact seed lists).
- Map each article to QID; pull attributes; normalize labels and regions.
- Compute monthly panel: counts and shares by gender, region, occupation; growth and churn.
- Visualize trends; identify statistically significant shifts (bootstrap CIs).
- Publish CSVs and a dashboard; write a short narrative with key insights and caveats.
Key metrics
- Share by gender/region/occupation per month.
- Month-over-month and YoY deltas; top rising/falling occupations.
- Concentration (entropy/Gini) of occupation mix.
- (Optional) Coverage vs external baseline (e.g., world region population), clearly sourced.
Deliverables
- Clean datasets (
raw,entities,monthly,metrics). - Reproducible notebook +
reports/representation_gaps.md. - Streamlit/Altair dashboard with time series and filters.
Ethics & limitations
- Use Wikidata-stated attributes only; do not infer sensitive traits.
- Aggregate reporting; suppress small cells; document missingness and ambiguity.
- Clearly state that Wikipedia coverage reflects contributor behavior and not ground truth.
- Progress:
Got the pipeline up and running: pulling Wikipedia bios, grabbing Wikidata info (gender + occupation), and saving everything into a clean dataset. Tested it on a small batch just to make sure all the steps work end-to-end. Data looks good so far: almost no missing genders, only a couple missing occupations. Focus for this stage has been prototype validation rather than scale
- Blockers:
The dataset’s still small, so I can’t really do much analysis yet. Need to scale up. Occupations are very music-heavy (lots of rappers/singers). That might just be because of the categories I used, or because the sample’s small. Will need to dig into this more once I’ve got more data. A few occupations are still missing (2–3%), so I’ll need to beef up the enrichment process later.
-
Availability: 40
-
ETA:
Plan for next week is to grab way more bios (1,000+), and maybe add extra categories if needed (athletes, scientists, etc.) so it’s not just full of musicians. Also going to start working on grouping occupations — so things like “rapper/singer/songwriter” get bucketed together under something like “musician/artist.” Once I’ve got a bigger and more balanced sample, I’ll start looking at trends over time and eventually move into building dashboards
- Pictures:
Screenshot of the pipeline output so far, Looks like the data pipeline is working end-to-end.
- Progress:
- Scaled it way up — we’re now at ~1.1M bios all enriched and cleaned.
- Gender, occupation, and country normalization are in place, and I added buckets (like grouping all sports/arts/politics, etc.), so it’s much easier to see patterns.
- Pulled in all the countries and fixed up the region mapping, so the dataset isn’t drowning in ‘unknowns’.
- Did some patching on edge cases (historic states, odd occupations), for more consistency.
tldr; last week: does the pipeline even work? Today: here’s a proper dataset we can analyze.
- Blockers:
- A chunk of entries don’t have any citizenship/country in Wikidata at all (no QIDs, empty lists), so they show up as Unknown even after patching for historic labels/territories and micro-regions. It’s ~18% of the dataset right now which is a little more than ideal but doesn’t necessarily;y block analysis.
- Option to try inferring the country/region form other properties (place of birth, work location, citizenship, place of death (last resort) to reduce unknown OR keep it visible as its own category and moving on
-
Availability: 30
-
ETA
- This week will add the time series — pulling in creation dates so we can actually see how these gaps shift over time.
- Overall, I’ll start running more detailed breakdowns/analysis and then move on to visualizations:
- Tentative completion date: 10/13/2025
- Initial insights: Occupation category: Sports ~45%, Arts & Entertainment ~21%, Politics & Law ~13%. Gender overall: Male: - 73.9%, Female: - 25.7% and 0.4% NB/trans categories/unknown.
Picture:
- Progress:
- Spent a lot of time debugging the API calls. I found and fixed two major bugs that were stopping the script from getting the QIDs and timestamps correctly
- The timestamp collection is currently running (it's taking a while; have to request every page one-by-one)
- Blockers
- No real blockers right now. The main issue I ran into was when I tried to mess with the pipeline to add the logic for inferring missing country data (from place of birth, etc.). It messed things up and forced a restart of the data collection.
- Went back and forth with Gemini for help this time to get the process rebuilt correctly, and now it's running smoothly.
-
Availability: 25
-
ETA
- Once the timestamp collection is done, enriching and normalizing should not take too long as I am a lot more familiar.
- Tentative full completion date: 10/20/2025
- Progress:
- Timestamp collection, enriching, normalizing and initial dashboard is done. Need some feedback on dashboard for improvements if needed
- Implemented a working monthly incremental refresh pipeline using recentchanges to pull newly created biographies. Set up checkpointing so only biographies created after the last run are fetched.
- README with: Project overview Monthly refresh instructions Data source documentation Clear caveats & methodological notes (occupation bucketing, gender grouping, country-region mapping, filtering logic, etc.).
-
Blockers: will need to learn how to set up a scheduled workflow using GitHub Actions to run the refresh automatically at the start of each month.
-
Availability: 20
-
ETA:
- Finalize cleanups/documentations/dashboard
- set up a scheduled workflow using GitHub Actions to run the refresh automatically at the start of each month.
Hi Ashik
First it's what idea for your report would be to connect the underrepresentation of women on Wikipedia to broader treads of chauvinism or misogyny in America over time. Check out that angle
Progress: Snapshot of reports/representation_gaps.md
2. Gender Representation
.png)
A modest improvement since 2015 is visible. Between 2015 and 2025, the male share declined from ≈ 72% to 65% (a 7 percentage point, or pp, drop), which was almost entirely absorbed by a corresponding rise in the female share from ≈ 28% to 34%. (A percentage point is the simple arithmetic difference between two percentages; a drop from 72% to 65% is a 7pp change). Non-binary representation, while still below 1%, has tripled since 2018.
This 7pp improvement coincides with peak #MeToo awareness (2017-2019) and overlaps with Hillary Clinton's 2016 presidential campaign and Kamala Harris's 2020 vice-presidential election—suggesting Wikipedia responds to, but doesn't lead, cultural shifts in valuing women's contributions. However, this slow narrowing of the gap also highlights the persistence of the underlying asymmetry. The disparity remains largest in historically male-centric domains such as sports, politics, and the military, where definitions of notability are rigidly tied to professional achievements, competitive rankings, or high office—domains from which women were long excluded, resulting in a profoundly skewed source record.
3. Wikipedia Bias as a Mirror of American Misogyny
Wikipedia's gender gaps don't exist in isolation—they reflect and reinforce broader patterns of American cultural chauvinism over the past decade.
The 2016 Presidential Campaign & Initial Backlash
Hillary Clinton's historic 2016 presidential run coincided with the start of our data window. Despite being the first woman nominated by a major party, female biography share remained at only 28% (2015-2016). This suggests that even high-visibility political milestones don't automatically translate to improved representation—the structural barriers remain intact.
The #MeToo Effect (2017-2019)
- Female biography share increased from 28% (2015) to 32% (2019)—a 4pp gain in just 4 years
- This aligns with peak #MeToo activism (October 2017 onward) when women's stories gained mainstream visibility
- Arts & Culture showed particularly sharp gains during this period, reflecting increased media attention to women's contributions in entertainment and creative fields
The Backlash Era (2020-2025)
- Progress stalled: Female share plateaued at ~34% (only 2pp gain in 6 years)
- Despite Kamala Harris becoming the first female, Black, and South Asian Vice President (2021), the momentum from 2017-2019 dissipated
- This mirrors:
- Rise of anti-"woke" rhetoric (2020-present)
- Attacks on DEI initiatives (2022-2024)
- Post-Dobbs rollback of reproductive rights (2022)
- Conservative redefinition of women's roles in public discourse
Key Finding: The gap narrowed fastest during peak feminist activism, then stabilized during cultural backlash—suggesting Wikipedia representation is reactive to, not independent of, broader gender politics. Even historic "firsts" like Harris's vice presidency didn't reverse the trend, indicating that symbolic victories without sustained cultural momentum have limited impact on systemic representation.
4. Occupational Composition and Gender Gaps

The "Notability" Double Standard
These occupational gaps expose how Wikipedia's supposedly neutral "notability" criteria encode historical chauvinism:
Military (95% male): Combat exclusion kept women out of military leadership until 2015. Wikipedia now documents this male-dominated past—but treats it as neutral history rather than systematic exclusion. The result: decades of all-male military leadership are codified as evidence of greater male "notability" rather than evidence of discrimination.
Sports (90% male): Despite Title IX (1972), women's sports remain underfunded and undercovered by media. Wikipedia's gap mirrors media bias: if ESPN doesn't cover women's sports, there are fewer "reliable sources" to cite. The platform then treats this media neglect as proof that women's athletic achievements are less notable.
Religion (85% male): Major world religions restrict women from leadership. Wikipedia documents this status quo without questioning whether male religious figures are inherently more "notable" than the women systematically excluded from power. Structural misogyny becomes encoded as theological fact.
Politics & Law (75% male): Despite record numbers of women running for office (2018 "Year of the Woman," 2020 Harris VP win), the gap barely moved (–4 pp). This suggests that even when women achieve political prominence, they face higher bars for Wikipedia inclusion—a reflection of broader "likability" penalties and electability concerns that plague women candidates.
The Pattern: Fields where women were formally or informally barred show the widest gaps. Wikipedia treats historical exclusion as evidence of lower "notability" rather than evidence of discrimination. This is structural chauvinism masquerading as objectivity.
5. Geographic Representation and Continental Gaps

Wikipedia's geographic footprint is sharply uneven. Five countries — United States, United Kingdom, India, France, and Germany — generate over 50% of all new biographies since 2015. This illustrates a structural under-coverage tied to language, editor demographics, and editorial accessibility.
At the continental level, the imbalance is stark:
- Europe + North America: ≈ 60 % of biographies
- Asia: ≈ 25 % (but ≈ 60 % of world population)
- Africa: ≈ 8 %
- Oceania + South America: ≈ 7 %
This geographic bias compounds the gender gap. A female subject from an under-represented region (e.g., a politician in Africa or an academic in Southeast Asia) faces a "double gap," requiring a far higher threshold of notability and source availability than a male counterpart in Europe or North America.
American Exceptionalism and Gender
The US dominates biographical coverage (19.6% of all articles), but American women face a double bind:
-
Domestic bias: American culture's own gender hierarchies (pay gaps, political underrepresentation, "likability" penalties for women leaders) mean fewer women reach the visibility threshold for Wikipedia coverage. The 2016 and 2020 elections showed that even women reaching the highest levels of American politics (Clinton's nomination, Harris's vice presidency) face intense scrutiny and media negativity that their male counterparts don't—resulting in fewer "positive" reliable sources.
-
Export of bias: As the largest Wikipedia language community, English Wikipedia's American-centric notability standards become global gatekeepers. A female Indian scientist must meet American media's definition of "importance"—a standard that already undervalues women. If The New York Times or BBC don't cover her work, she likely won't meet notability criteria, regardless of her impact in India.
This is cultural imperialism compounding gender bias: America exports its own chauvinistic notability standards worldwide.
To visualize this proportional bias, a representation-gap index was computed (Biography % – Population %). This "pp" value shows how many percentage points a continent's share of biographies is above (a positive value) or below (a negative value) its share of the world population.
7. Summary of Key Insights
-
Gender bias reflects cultural misogyny: The 2:1 male-to-female ratio persists because Wikipedia's "neutral" policies encode historical exclusion. Notability standards privilege fields (military, sports, politics) where women were systematically barred—then treat that male dominance as evidence of greater importance. This is structural chauvinism masquerading as objectivity.
-
Gaps are "sticky": The largest gender deltas are in Sports (+82 pp) and Military (+91 pp), and these gaps have barely changed. The most progress is in Arts & Culture (–6 pp) and Agriculture (–8 pp).
-
Occupational dominance: Four fields (Sports, Arts, Politics, STEM) monopolize ≈ 98% of biographical attention, marginalizing other human endeavors.
-
Bias is intersectional: Geographic and gender biases compound each other. A non-male subject from the Global South faces a "double barrier" to inclusion.
-
Geographic imbalance is severe: Europe and North America account for ~60% of entries. Asia is under-represented by a staggering –40 pp relative to its population.
-
Gaps are independent of volume: Fluctuations in article creation (like the 2020-2022 decline) had no meaningful effect on the proportions of representation. Equity requires intent, not just volume.
-
Timeline mirrors American gender politics: Progress accelerated during #MeToo (2017-2019), coinciding with peak awareness of women's issues. It then stalled during the anti-feminist backlash (2020-2025), even as Kamala Harris broke barriers. Wikipedia doesn't just document history—it absorbs and amplifies contemporary gender battles.
9. Conclusion
From 2015 to 2025, Wikipedia's biography corpus expanded but failed to diversify in a meaningful way. The fundamental distribution of visibility has changed very little: Men, Western professions, and Euro-American regions still dominate the historical record.
The issue is not quantitative; it is qualitative and structural. Achieving representational parity will require a fundamental shift away from passive, quantitative growth toward active, qualitative editorial diversification. This must involve interrogating the very systems that define who counts as "notable," addressing the demographic skew of the editor community, and proactively surfacing and translating voices from the Global South.
The Misogyny of "Neutrality"
Wikipedia's most insidious bias isn't overt sexism—it's the claim of objectivity. By treating historical male dominance as neutral fact rather than the product of systematic exclusion, Wikipedia naturalizes gender inequality. When notability criteria favor fields women were barred from entering, that's not neutral—that's laundering misogyny through bureaucratic process.
The American dimension matters because English Wikipedia's scale makes US cultural biases—about whose lives matter, which achievements count—into global defaults. America's unfinished reckoning with gender inequality doesn't just shape domestic Wikipedia coverage; it exports a template of chauvinism that marginalizes women worldwide.
The data shows a clear pattern: representation improved during moments of feminist cultural prominence (Clinton's campaign, #MeToo, Harris's election), then stagnated when cultural attention shifted elsewhere. This proves Wikipedia is not a neutral archive but a live wire connected to American political currents. When the culture wages war on "wokeness" and dismantles DEI, Wikipedia's representation gaps widen in lockstep.
True equity requires naming this bias for what it is: not a gap to be slowly closed through "more articles," but a structural commitment to valuing men's lives and achievements above women's. Until Wikipedia interrogates its own complicity in perpetuating these hierarchies, representation will remain symbolic at best.
There will be individual static charts here for each section (just placeholders for now);
Updated the dashboard;
- Blockers: None
- Availability ~ 10
- ETA
- Dashboard png:
- Progress: Included statistical analysis and the intersectional analysis. The statistical one digs into things like interrupted time series and changepoint detection to see where we're actually seeing significant shifts in representation. The intersectional piece looks at how different demographic factors interact with each other. I updated the representation_gaps.md file with the findings from both of these, so everything's documented there now. Also started putting together the presentation deck to share what we found.
New findings: The gender gap is stuck at around 68.6% male despite the whole #MeToo movement. We actually found that Wikipedia was improving faster before #MeToo (+3.2 percentage points per year, statistically significant), and our changepoint detection flagged 2017 and 2023 as major shifts in the data. The "pipeline problem" excuse doesn't hold up - we looked at birth cohorts and found that people born in the 1990s-2000s have basically the same 47pp gender gap as people born in the 1970s-1980s. So this isn't a generational thing that'll fix itself. Occupational gaps are extreme - Military is 95% male, Sports 90% male, Politics 75% male. And when you look at intersections, it gets worse: female military subjects in Europe (the most favorable conditions) are still 10.5× less likely to have a Wikipedia page than their male counterparts. Geographic concentration actually quadrupled over the decade. Europe is about 4× over-represented relative to population, while Asia is 66% under-represented. More content didn't fix the bias - it made it worse.
-
Blockers: None
-
Availibility: ~10
-
ETA: Powerpoint presentation should be ready by next week.
-
Progress: Presented today (11/17).
-
Blockers: None
-
Availability: ~10
-
Next step: Create wiki page of the project overview (detailed article page; with link to the powerpoint), update powerpoint with logo and upload to Gdrive.
- Progress: Wiki page - https://github.com/hackforla/data-science/wiki/Wikipedia-Representation-Gaps;
needs review
- Blockers: None
- Availability: 3-5