Prefix handling in xpath queries does not resolve namespaces
Compare
package main
import (
"fmt"
"github.com/beevik/etree"
)
const xmlData = `<root xmlns:b='foo'><a /><b:b /><b:b xmlns:b='bar' /></root>`
func main() {
doc := etree.NewDocument()
doc.ReadFromString(xmlData)
fmt.Printf("%+v\n", doc.FindElements("//b"))
fmt.Printf("%+v\n", doc.FindElements("//b:b"))
}
which produces
$ go run showcase.go
[0xc0000b4240 0xc0000b42a0]
[0xc0000b4240 0xc0000b42a0]
to
import xml.etree.ElementTree as ET
import io
XML_DATA = "<root xmlns:b='foo'><a /><b:b /><b:b xmlns:b='bar' /></root>"
doc = ET.parse(io.StringIO(XML_DATA))
print(doc.findall('./b'))
# Fails, prefix b not defined
# print(doc.findall('.//b:b'))
print(doc.findall('./b:b', {'b': 'foo'}))
print(doc.findall('./b:b', {'b': 'bar'}))
# different prefix, still finds the same element!
print(doc.findall('./c:b', {'c': 'bar'}))
which results in
$ python showcase.py
$ python showcase.py
[]
[<Element '{foo}b' at 0x7f0701e43e90>]
[<Element '{bar}b' at 0x7f0701e43ef0>]
[<Element '{bar}b' at 0x7f0701e43ef0>]
Note that in the Go version, both queries return both elements that have b as local name and prefixes are only compared as text string. The python version is correct regarding to namespaces since:
- the unnamed namespace does not match any other namespace
- prefixes are solved to namespace uris. This implies that prefixes in xpath expressions have to be defined first. After that the actual prefix does not matter only the backing namespace uri
It would be nice if your etree package would offer similar features. Searching by prefix only is a blocker when receiving XML documents, where prefixes are unknown (Like the output of Go's XML Encoder that uses strange, but correct, prefix names and placement). How would you search for an XML element by namespace at all?
Does the namespace-uri feature not work in this case? For example:
fmt.Printf("%+v\n", doc.FindElements("//b[namespace-uri()='foo']"))
fmt.Printf("%+v\n", doc.FindElements("//b[namespace-uri()='bar']"))
Thanks a lot for looking into this!
That does work for me, the output is
[0xc0000b4240]
[0xc0000b42a0]
so, two different nodes, that looks correct.
Testing it in python was hard as Python's etree does not support namespace-uri(). I had to switch to lxml here and call their full xpath implementation. So in Python I did
print(doc.xpath("//b[namespace-uri()='foo']"))
print(doc.xpath("//b[namespace-uri()='bar']"))
where I had to do import lxml.etree as ET. As output I got two times [], probably because it tries to match the empty namespace prefix from the tag and the namespace-uri() predicate and those contradict each other.
To sum up, I think using namespace-uri() is a workaround, but not an actual fix. Behaviour is different than the XPath specs and than Python's etree implementation, that might be confusing to users from a module that was inspired by that Python module.
I also don't see how namespace-uri() will ever work with attribute namespaces in the current implementation. (Given <root xmlns:ns1='...' xmlns:ns1='...' ns1:a='x' ns2:a='y'> how to select elements based on the value of ns1:a?)
Below is a slightly modified version of your example:
<root xmlns:x='foo'>
<x:child/>
<x:child xmlns:x='bar'/>
</root>
The namespace prefix x is defined as foo at the root element's scope. It is then redefined in the second child element's scope as bar.
I've tried this XML with several online XPath testers using the query //x:child, and they give different results. Some return both x:child elements. Others return only the second x:child element. After checking the XPath standard, I'm not surprised that everyone does it differently. To me, the standard isn't clear on how to deal with this situation of nested, redefined namespace declarations.
The python ElementTree module appears to support a namespace dictionary parameter, which you pass as an optional parameter when searching the XML document. That's certainly one way to resolve this problem. Unfortunately, this etree package does not (currently) support the passing of namespace dictionaries to path queries.
As a last remark, I'd like to point out that this package's path searching functionality is not intended to be fully compliant with XPath (or, for that matter, with the python ElementTree module). It merely claims to be inspired by python's ElementTree and to support "XPath-like" search features.
Can you give me an example of a type of query you're finding impossible to carry out? It's possible there are bugs in my code.
The second example you gave was this one:
<root xmlns:ns1='foo' xmlns:ns2='bar' ns1:a='x' ns2:a='y'>
You then asked how to select elements based on the value of ns1:a. I believe the following would work:
//*[@ns1:a='x']
Granted, there is currently no way to select an attribute belonging to a particular namespace URI, but I'm not sure if that's what you were asking how to do.
Thanks for taking a deep look into this. XML standards are hard. Your reply has basically two parts, let me answer both of them below.
Disclaimer: I understand that this is a 1% use case and most people are probably fine with the current library features (check out https://github.com/golang/go/issues/9519 for example other people use namespace with Go). If you are in control of the strict input XML format, matching the literal prefix certainly is a viable and easy solution.
XML standards
Which online sites did you try? From my experience the Java XML implementation and everything based on libxml2 is pretty accurate.
For nested names, I think the relevant part is in XML Names 1.0, Sec. 6.1:
The scope of a namespace declaration declaring a prefix extends from the beginning of the start-tag in which it appears to the end of the corresponding end-tag, excluding the scope of any inner declarations with the same NSAttName part. In the case of an empty tag, the scope is the tag itself.
So the scope of inner redeclarations is exluded from the scope of a namespace declaration.
For comparing elements, XPath 1.0 is indeed the place to look. Section 2.3 has:
A node test that is a QName is true if and only if [...] and has an expanded-name equal to the expanded-name specified by the QName. [...] A QName in the node test is expanded into an expanded-name using the namespace declarations from the expression context. This is the same way expansion is done for element type names in start and end-tags except that the default namespace declared with xmlns is not used: if the QName does not have a prefix, then the namespace URI is null (this is the same way attribute names are expanded). It is an error if the QName has a prefix for which there is no namespace declaration in the expression context.
So, by that a name is always expanded and prefixes must be declared. This looks pretty clear to me, but I am by no means an expert for XML standard.
My use case
My use case is bascially parsing XML where I have no control over the namespace prefixes, but I know the namespaces from the definition of the XML structure (think XML schema). An example would even be Go's included XML writer (in encoding/xml), that seems to re-use the default namespace wherever possible but randomly uses '__' or '__1' for attributes if conflicts arise. This seem to be new, since cl/116056. IIRC the previous behaviour was just using ns0, ns1,.... So in light of changing behaviour of prefix handling, it seems important to not rely on the literal prefix.
While I can use the library in it's current form using namespace-uri() it feels cumbersome.
For attributes, you are right. I want to select attributes based on the namespace uri not based on the prefix (remember I don't know the attributes prefix). For now hand-written code is probably needed.
Random idea
Add a variant of FindElement (and possibly other functions), that takes an object that resolves namespaces, maybe call it FindElementNs. An optional second argument might be something like
type PrefixResolver interface {
ResolvePrefix(string) string
}
One could have a type wrapping a map[string]string implementing it using element lookup.
WDYT?
I have decided to close this issue, as I believe it is an edge case, and I would prefer to keep the etree API as simple as possible.