rquery icon indicating copy to clipboard operation
rquery copied to clipboard

Text for elements containing child nodes is corrupt or incomplete

Open ancamcheachta opened this issue 8 years ago • 0 comments

Background

I currently have a requirement in which the text of multiple child nodes must be concatenated.

Let's say I have the following HTML:

<h2>
	Excerpt from <em>Cré na Cille</em>
</h2>
<div id="div-with-children">
	<p>
		<strong>Ní mé</strong> an ar Áit <em>an Phuint</em> nó <em>na gCúig Déag</em> atá mé curtha?
	</p>
</div>

What I'd like when selecting the div above, would be to get the text from the div along with its child nodes in one single, concatenated string. This JSFiddle demonstrates the behaviour using jQuery.

However, what I get as of 0.4.0 is a concatenation of the characters found at the element level, with gaps wherever the child nodes begin/end in the original layout.

Reproducible Scenario

Create and run this main.rs file.

extern crate rquery;
use rquery::Document;

fn new_document() -> Document {
    Document::new_from_xml_string(r"
<html>
<head></head>
<body>
<h2>
  Excerpt from <em>Cré na Cille</em>
</h2>
<div id='div-with-children'>
  <p>
  <strong>Ní mé</strong> an ar Áit <em>an Phuint</em> nó <em>na gCúig Déag</em> atá mé curtha?
  </p>
</div>
</body>
</html>
").unwrap()
}

fn main() {
    let document = new_document();
    let element = document.select("div").unwrap();
    
    println!("{:?}", element);
}

Actual Behaviour

element.text() is nothing but whitespace: "\n \n"

Element {
	node_index: 6,
	tag_name: "div",
	children: Some([
		Element {
			node_index: 7,
			tag_name: "p",
			children: Some([
				Element {
					node_index: 8,
					tag_name: "strong",
					children: None,
					attr_map: {},
					text: "Ní mé"
				}, Element {
					node_index: 9,
					tag_name: "em",
					children: None,
					attr_map: {},
					text: "an Phuint"
				}, Element {
					node_index: 10,
					tag_name: "em",
					children: None,
					attr_map: {},
					text: "na gCúig Déag"
				}
			]),
			attr_map: {},
			text: "\n   an ar Áit  nó  atá mé curtha?\n  "
		}
	]),
	attr_map: {"id": "div-with-children"},
	text: "\n  \n"
}

Desired Behaviour

element.text() should be a concatenation of all text (eg. "Ní mé an ar Áit an Phuint nó na gCúig Déag atá mé curtha?")

Proposed Solution

Adding child contents to parent text while streaming the document https://github.com/yggie/rquery/pull/8

Potential Impact

If you go with the changes in my PR (pushing child text to the parent while streaming XmlEvent variants) I guess the main concern is memory. Depending on the size of a document, the text for its outermost nodes could be quite large.

In future, perhaps placeholders could be added to the text field while streaming the document, then the Element text() function could be responsible for walking the node tree, merging into the parent text, and returning the final value.

ancamcheachta avatar Nov 01 '17 22:11 ancamcheachta