mwparserfromhell icon indicating copy to clipboard operation
mwparserfromhell copied to clipboard

Text extraction from File: wikilinks has an issue

Open ikuyamada opened this issue 11 years ago • 5 comments

mwparserfromhell seemingly has an issue to extract text from "File:" wikilinks with additional attributes.

In [1]: import mwparserfromhell
In [2]: w = "[[File:test.jpg|thumb|Label text]]"
In [3]: mwparserfromhell.parse(w).nodes[0].text
Out[3]: u'thumb|Label text'

I think the desired output is not "thumb|Label text" but "Label text".

ikuyamada avatar Nov 30 '14 09:11 ikuyamada

@ikuyamada I would actually expect it to spit out an array containing("thumb","Label text"). I'm guessing that it just hasn't evolved to that yet, and lacking that kind of support, "thumb|Label text" seems correct to me.

Technical-13 avatar Nov 30 '14 14:11 Technical-13

"thumb|Label text" is correct, since the parser treats all wikilink-like things the same way. Ideally, we would understand what a file is and treat its caption specially (so you could do node.caption instead of node.text, which would give the entire chunk), but this is problematic since we don't have a reliable way to determine what is a file link and what isn't, due to site- and language-specific namespace aliases. I suppose we could just have .caption exist for all links, but this would entail new parsing rules. I'm willing to add this since it's been requested before.

earwig avatar Nov 30 '14 15:11 earwig

Feel free to :fish: me if it is already in there, but does this mean that you are going to have it parse the whole string to have it output node.height, node.width, node.align, node.valign, node.mode (thumb, frameless, etc), node.link? If you are going to parse out each chunk, then you might as well put them in their own places.

Technical-13 avatar Nov 30 '14 15:11 Technical-13

Hm... that's a bit clunky, but I suppose it's better than having a dictionary or some other alternative I can't think of right now.

earwig avatar Jan 14 '15 06:01 earwig

Many arguments for file links can also have localized forms...

ricordisamoa avatar Jan 15 '15 01:01 ricordisamoa