scraper
scraper copied to clipboard
Incorrect parsing doc with multiple self-closing nodes
Document like following
<data>
<a />
<b />
</data>
will be parsed to
...
<data>
<a>
<b>
</b>
</a>
</data>
...
I tried to investigate this issue but it seems that it is the expected behavior after all.
Chromium is doing the same:
And here is what I found about this topic:
- Self-closing syntax creation (tokenization): “Self-closing start tag state”
- Location: WHATWG HTML Standard → Parsing → Tokenization
- Effect: When the lexer sees “/>”, it sets the token’s self-closing flag; it does not create an end tag.
- What happens to that flag (tree construction): “The ‘in body’ insertion mode” (and other HTML insertion modes)
- Location: WHATWG HTML Standard → Parsing → Tree construction
- Rule: For ordinary HTML start tags, the algorithm inserts the element as usual and does not acknowledge the self-closing flag. The element stays open.
- If a start tag’s self-closing flag is set and is not acknowledged during tree construction, it is a parse error (“self-closing flag not acknowledged”). This matches html5ever’s “Unacknowledged self-closing tag”.
- Void elements list (the only HTML elements that are “empty” by definition)
- Location: WHATWG HTML Standard → The HTML syntax → Elements → “Void elements”
- Examples: area, base, br, col, embed, hr, img, input, link, meta, param, source, track, wbr.
- For these, a start tag inserts the element and there’s nothing to close; using “/>” is allowed but not required.
What this means for then
- is a normal HTML start tag with the self-closing flag set; the tree construction algorithm does not acknowledge that flag in HTML content, so remains open.
- When is parsed, the current node is still . The appropriate insertion location is LastChild(), so is inserted inside .
If the tag would be some "void element" instead, the expected behavior is also respected:
extern crate scraper;
fn main() {
let data = r#"
<data>
<br />
<b />
</data>"#;
let html = scraper::Html::parse_document(&data);
println!("{}", html.html());
}
Will result in:
<html><head></head><body><data>
<br>
<b>
</b></data></body></html>