mwparserfromhell
mwparserfromhell copied to clipboard
Has trouble telling sections apart on "Barack Obama"
To reproduce:
import mwparserfromhell
import requests
obama = requests.get("https://en.wikipedia.org/wiki/Barack_Obama?action=raw").text
parsed = mwparserfromhell.parse(obama)
sections = parsed.get_sections(levels=[2])
for section in sections:
print(section.filter_headings())
This results in:
['==Early life and career==', '===Education===', '===Family and personal life===', '===Religious views===']
['==Legal career==', '===Civil rights attorney===']
['==Legislative career==', '===Illinois Senate (1997–2004)===', '===2004 U.S. Senate campaign in Illinois===', '===U.S. Senate (2005–2008)===']
['==Presidential campaigns==', '===2008===', '===2012===']
['==Presidency (2009–2017)==', '===First 100 days===', '===Domestic policy===', '====Racial issues====', '====LGBT rights====', '===== Same-sex marriage =====', '====Economic policy====', '====Environmental policy====', '====Health care reform====', '===Foreign policy===', '====War in Iraq====', '====Afghanistan and Pakistan====', '=====Killing of Osama bin Laden=====', '====Relations with Cuba====', '====Israel====', '====Libya====', '====Syrian civil war====', '====Iran nuclear talks====', '====Russia====']
['==Cultural and political image==', '=== Job approval ===', '===Foreign perceptions===', '=== Thanks, Obama ===', '==Post-presidency (2017–present)==', '==Legacy and recognition ==', '===Presidential library===', '=== Awards and honors ===', '===Eponymy===', '==Bibliography==', '===Books===', '===Audiobooks===', '===Articles===', '==See also==', '===Politics===', '===Other===', '===Lists===', '==Notes==', '==References==', '===Bibliography===', '==Further reading==', '==External links==', '===Official===', '===Other===']
There are more level-2 headers in the article, but it stops after "Cultural and political image", lumping the rest of the article into that section.
This edit doesn't fix the issue, so it's not quite that basic.
There is ''Gallup. Inc''' on the line starting with Obama's approval rating fell to 38 percent. If you remove the extra apostrophe, it will work. The cause is #40