mwparserfromhell icon indicating copy to clipboard operation
mwparserfromhell copied to clipboard

filter_external_links truncates colon at the end of URL

Open harej opened this issue 1 year ago • 2 comments

Test case:

import mwparserfromhell

wikitext = """
<ref>{{cite news | first=109th Congress, 1st Session | last=U.S. Senate |  title= S. 1033, Secure America and Orderly Immigration Act | date=[[May 12]] [[2005]] |  url =http://thomas.loc.gov/cgi-bin/bdquery/z?d109:SN01033: | work =Thomas |  accessdate = 2007-09-30 | }}</ref>
"""

parsed = mwparserfromhell.parse(wikitext)
parsed.filter_external_links()

What I get: ['http://thomas.loc.gov/cgi-bin/bdquery/z?d109:SN01033'] What I should get: ['http://thomas.loc.gov/cgi-bin/bdquery/z?d109:SN01033:'] with the colon at the end

harej avatar Sep 11 '24 05:09 harej

Yes, that's a valid URL, or at least it was nearly 20 years ago. https://web.archive.org/web/20080918055001/http://thomas.loc.gov/cgi-bin/bdquery/z?d109:SN01033:

(You may need to copy that URL with the colon into the address bar manually)

harej avatar Sep 11 '24 05:09 harej

See the difference here:

import mwparserfromhell

wikitext1 = "http://thomas.loc.gov/cgi-bin/bdquery/z?d109:SN01033:"
wikitext2 = "[http://thomas.loc.gov/cgi-bin/bdquery/z?d109:SN01033: foo]"

parsed1 = mwparserfromhell.parse(wikitext1)
parsed2 = mwparserfromhell.parse(wikitext2)
print(parsed1.filter_external_links())
print(parsed2.filter_external_links())

Which gives

['http://thomas.loc.gov/cgi-bin/bdquery/z?d109:SN01033']
['[http://thomas.loc.gov/cgi-bin/bdquery/z?d109:SN01033: foo]']

Note that this is consistent with how MediaWiki behaves :shrug:

For your snippet, the thing is that mwparserfromhell does not expand templates so it can't know that the url parameter is actually used inside square brackets.

lahwaacz avatar Nov 15 '24 11:11 lahwaacz