mwparserfromhell icon indicating copy to clipboard operation
mwparserfromhell copied to clipboard

Parser misses the infobox on the current Lombardy page (possibly because of a comment in the name?)

Open ramayer opened this issue 4 years ago • 2 comments

The mwparserfromhell parser is missing some infoboxes, such as the one on the current Lombardy page ( https://en.wikipedia.org/wiki/Lombardy ).

I suspect it's probably because someone put a comment in the infobox 's first field like this:

{{Infobox settlement 
 < !-- See Template:Infobox settlement for additional fields and descriptions -- > | name                            = Lombardy 
 | official_name                   =  
 | native_name                     = {{native name|it|Lombardia}} < br/ > {{lang|lmo|Lombardia}} 
 | native_name_lang                =  
 | settlement_type                 = [[Region of Italy]] 
 ...
}}

This is the code I used. The table tmp_wikipedia contains just the original title and body from the wikipedia dump from last week.

lombardy = spark.sql('''select body from tmp_wikipedia where title = 'Lombardy' limit 1''').take(1)[0].asDict(True)
parsed = mwparserfromhell.parse(lombardy['body'])
parsed.filter_templates()

and the result is all templates on the page except the Infobox (which is arguably the most interesting template on the page).

ramayer avatar Apr 17 '21 22:04 ramayer

I see other issues where "skip_style_tags=True" is a workaround - but it didn't help in this case.

I modified my code to try:

 parsed = mwparserfromhell.parse(lombardy['body'],skip_style_tags=True)
 parsed.filter_templates()

and still don't see the infobox from Lombardy's page.

ramayer avatar Apr 17 '21 22:04 ramayer

What version of mwparserfromhell are you using, and what revision ID of Lombardy are you trying to load? I don't have any problem parsing the infobox with the latest parser version on the current revision of that page.

earwig avatar Apr 19 '21 06:04 earwig

Thanks - it took a while for me to try again; but it's working for me now.

ramayer avatar Aug 17 '22 17:08 ramayer