python-xextract icon indicating copy to clipboard operation
python-xextract copied to clipboard

Allow Elements to be passed to parse_*()

Open levic opened this issue 3 years ago • 1 comments

Addresses #10

@Mimino666 There's no documentation here yet (I wasn't going to add it until you're happy with what I've done)

Handling parse() was an unexpected quirk: if we only have an Element then it doesn't look like we can know whether a document was parsed as HTML or XML so we don't know whether to use an XML or a HTML extractor.

We can guess based on the presence (or not) of a namespace on the Element, but you can still parse XML snippets without a namespace so that could still lead to unexpected results. It also has the side effect of casting the Element back to a string as part of the XML header snooping which is what we were trying to avoid in the first place (although a check for this could be added).

I've opted to force the caller to be explicit: if you want to pass an Element to parse() then you must use parse_html() or parse_xml() instead.

levic avatar Feb 06 '23 16:02 levic

Calling code would now look like:

    def test_element_as_parser(self):
        """
        we can pass an Element as the extractor to parse_*()
        """
        html = '''
            <div><span>Hello world!</span></div>
            <div></div>
            <div><span>Hello mars!</span></div>
        '''

        # take only the first containers so we can verify that the correct descendant is chosen
        container = Element(css='div', count=3).parse(html)[2]

        val = Element(css='span', count=1).parse_html(container)
        self.assertEqual(val.tag, 'span')
        self.assertEqual(val.text, 'Hello mars!')

The important line is val = Element(css='span', count=1).parse_html(container). Instead of re-parsing the tree the container Element passed to parse_html() is simply wrapped up in a new HtmlXPathExtractor.

levic avatar Feb 06 '23 17:02 levic