Parser misses the infobox on the current Lombardy page (possibly because of a comment in the name?)
The mwparserfromhell parser is missing some infoboxes, such as the one on the current Lombardy page ( https://en.wikipedia.org/wiki/Lombardy ).
I suspect it's probably because someone put a comment in the infobox 's first field like this:
{{Infobox settlement
< !-- See Template:Infobox settlement for additional fields and descriptions -- > | name = Lombardy
| official_name =
| native_name = {{native name|it|Lombardia}} < br/ > {{lang|lmo|Lombardia}}
| native_name_lang =
| settlement_type = [[Region of Italy]]
...
}}
This is the code I used. The table tmp_wikipedia contains just the original title and body from the wikipedia dump from last week.
lombardy = spark.sql('''select body from tmp_wikipedia where title = 'Lombardy' limit 1''').take(1)[0].asDict(True)
parsed = mwparserfromhell.parse(lombardy['body'])
parsed.filter_templates()
and the result is all templates on the page except the Infobox (which is arguably the most interesting template on the page).
I see other issues where "skip_style_tags=True" is a workaround - but it didn't help in this case.
I modified my code to try:
parsed = mwparserfromhell.parse(lombardy['body'],skip_style_tags=True)
parsed.filter_templates()
and still don't see the infobox from Lombardy's page.
What version of mwparserfromhell are you using, and what revision ID of Lombardy are you trying to load? I don't have any problem parsing the infobox with the latest parser version on the current revision of that page.
Thanks - it took a while for me to try again; but it's working for me now.