Text for elements containing child nodes is corrupt or incomplete
Background
I currently have a requirement in which the text of multiple child nodes must be concatenated.
Let's say I have the following HTML:
<h2>
Excerpt from <em>Cré na Cille</em>
</h2>
<div id="div-with-children">
<p>
<strong>Ní mé</strong> an ar Áit <em>an Phuint</em> nó <em>na gCúig Déag</em> atá mé curtha?
</p>
</div>
What I'd like when selecting the div above, would be to get the text from the div along with its child nodes in one single, concatenated string. This JSFiddle demonstrates the behaviour using jQuery.
However, what I get as of 0.4.0 is a concatenation of the characters found at the element level, with gaps wherever the child nodes begin/end in the original layout.
Reproducible Scenario
Create and run this main.rs file.
extern crate rquery;
use rquery::Document;
fn new_document() -> Document {
Document::new_from_xml_string(r"
<html>
<head></head>
<body>
<h2>
Excerpt from <em>Cré na Cille</em>
</h2>
<div id='div-with-children'>
<p>
<strong>Ní mé</strong> an ar Áit <em>an Phuint</em> nó <em>na gCúig Déag</em> atá mé curtha?
</p>
</div>
</body>
</html>
").unwrap()
}
fn main() {
let document = new_document();
let element = document.select("div").unwrap();
println!("{:?}", element);
}
Actual Behaviour
element.text() is nothing but whitespace: "\n \n"
Element {
node_index: 6,
tag_name: "div",
children: Some([
Element {
node_index: 7,
tag_name: "p",
children: Some([
Element {
node_index: 8,
tag_name: "strong",
children: None,
attr_map: {},
text: "Ní mé"
}, Element {
node_index: 9,
tag_name: "em",
children: None,
attr_map: {},
text: "an Phuint"
}, Element {
node_index: 10,
tag_name: "em",
children: None,
attr_map: {},
text: "na gCúig Déag"
}
]),
attr_map: {},
text: "\n an ar Áit nó atá mé curtha?\n "
}
]),
attr_map: {"id": "div-with-children"},
text: "\n \n"
}
Desired Behaviour
element.text() should be a concatenation of all text (eg. "Ní mé an ar Áit an Phuint nó na gCúig Déag atá mé curtha?")
Proposed Solution
Adding child contents to parent text while streaming the document https://github.com/yggie/rquery/pull/8
Potential Impact
If you go with the changes in my PR (pushing child text to the parent while streaming XmlEvent variants) I guess the main concern is memory. Depending on the size of a document, the text for its outermost nodes could be quite large.
In future, perhaps placeholders could be added to the text field while streaming the document, then the Element text() function could be responsible for walking the node tree, merging into the parent text, and returning the final value.