Incorrect parsing doc with multiple self-closing nodes

Open traceflight opened this issue 1 year ago • 1 comments

Document like following

<data>
    <a />
    <b />
</data>

will be parsed to

...
<data>
    <a>
      <b>
      </b>
  </a>
</data>
...

playground

Jan 20 '25 09:01 traceflight

I tried to investigate this issue but it seems that it is the expected behavior after all.

Chromium is doing the same:

And here is what I found about this topic:

Self-closing syntax creation (tokenization): “Self-closing start tag state”
- Location: WHATWG HTML Standard → Parsing → Tokenization
- Effect: When the lexer sees “/>”, it sets the token’s self-closing flag; it does not create an end tag.
What happens to that flag (tree construction): “The ‘in body’ insertion mode” (and other HTML insertion modes)
- Location: WHATWG HTML Standard → Parsing → Tree construction
- Rule: For ordinary HTML start tags, the algorithm inserts the element as usual and does not acknowledge the self-closing flag. The element stays open.
- If a start tag’s self-closing flag is set and is not acknowledged during tree construction, it is a parse error (“self-closing flag not acknowledged”). This matches html5ever’s “Unacknowledged self-closing tag”.
Void elements list (the only HTML elements that are “empty” by definition)
- Location: WHATWG HTML Standard → The HTML syntax → Elements → “Void elements”
- Examples: area, base, br, col, embed, hr, img, input, link, meta, param, source, track, wbr.
- For these, a start tag inserts the element and there’s nothing to close; using “/>” is allowed but not required.

What this means for then

is a normal HTML start tag with the self-closing flag set; the tree construction algorithm does not acknowledge that flag in HTML content, so remains open.
When is parsed, the current node is still . The appropriate insertion location is LastChild(), so is inserted inside .

If the tag would be some "void element" instead, the expected behavior is also respected:

extern crate scraper;

fn main() {
    let data = r#"
    <data>
    <br />
    <b />
    </data>"#;
    let html = scraper::Html::parse_document(&data);
    println!("{}", html.html());
}

Will result in:

<html><head></head><body><data>
    <br>
    <b>
    </b></data></body></html>

Oct 07 '25 10:10 sandrohanea