mwparserfromhell icon indicating copy to clipboard operation
mwparserfromhell copied to clipboard

Has trouble telling sections apart on "Barack Obama"

Open harej opened this issue 1 year ago • 2 comments

To reproduce:

import mwparserfromhell
import requests

obama = requests.get("https://en.wikipedia.org/wiki/Barack_Obama?action=raw").text
parsed = mwparserfromhell.parse(obama)
sections = parsed.get_sections(levels=[2])

for section in sections:
    print(section.filter_headings())

This results in:

['==Early life and career==', '===Education===', '===Family and personal life===', '===Religious views===']
['==Legal career==', '===Civil rights attorney===']
['==Legislative career==', '===Illinois Senate (1997–2004)===', '===2004 U.S. Senate campaign in Illinois===', '===U.S. Senate (2005–2008)===']
['==Presidential campaigns==', '===2008===', '===2012===']
['==Presidency (2009–2017)==', '===First 100 days===', '===Domestic policy===', '====Racial issues====', '====LGBT rights====', '===== Same-sex marriage =====', '====Economic policy====', '====Environmental policy====', '====Health care reform====', '===Foreign policy===', '====War in Iraq====', '====Afghanistan and Pakistan====', '=====Killing of Osama bin Laden=====', '====Relations with Cuba====', '====Israel====', '====Libya====', '====Syrian civil war====', '====Iran nuclear talks====', '====Russia====']
['==Cultural and political image==', '=== Job approval ===', '===Foreign perceptions===', '=== Thanks, Obama ===', '==Post-presidency (2017–present)==', '==Legacy and recognition ==', '===Presidential library===', '=== Awards and honors ===', '===Eponymy===', '==Bibliography==', '===Books===', '===Audiobooks===', '===Articles===', '==See also==', '===Politics===', '===Other===', '===Lists===', '==Notes==', '==References==', '===Bibliography===', '==Further reading==', '==External links==', '===Official===', '===Other===']

There are more level-2 headers in the article, but it stops after "Cultural and political image", lumping the rest of the article into that section.

harej avatar Sep 10 '24 23:09 harej

This edit doesn't fix the issue, so it's not quite that basic.

harej avatar Sep 10 '24 23:09 harej

There is ''Gallup. Inc''' on the line starting with Obama's approval rating fell to 38 percent. If you remove the extra apostrophe, it will work. The cause is #40

lahwaacz avatar Nov 15 '24 11:11 lahwaacz