scraper icon indicating copy to clipboard operation
scraper copied to clipboard

Incorrect parsing doc with multiple self-closing nodes

Open traceflight opened this issue 1 year ago • 1 comments

Document like following

<data>
    <a />
    <b />
</data>

will be parsed to

...
<data>
    <a>
      <b>
      </b>
  </a>
</data>
...

playground

traceflight avatar Jan 20 '25 09:01 traceflight

I tried to investigate this issue but it seems that it is the expected behavior after all.

Chromium is doing the same:

Image

And here is what I found about this topic:

  • Self-closing syntax creation (tokenization): “Self-closing start tag state”
  • What happens to that flag (tree construction): “The ‘in body’ insertion mode” (and other HTML insertion modes)
    • Location: WHATWG HTML Standard → Parsing → Tree construction
    • Rule: For ordinary HTML start tags, the algorithm inserts the element as usual and does not acknowledge the self-closing flag. The element stays open.
    • If a start tag’s self-closing flag is set and is not acknowledged during tree construction, it is a parse error (“self-closing flag not acknowledged”). This matches html5ever’s “Unacknowledged self-closing tag”.
  • Void elements list (the only HTML elements that are “empty” by definition)

What this means for then

  • is a normal HTML start tag with the self-closing flag set; the tree construction algorithm does not acknowledge that flag in HTML content, so remains open.
  • When is parsed, the current node is still . The appropriate insertion location is LastChild(), so is inserted inside .

If the tag would be some "void element" instead, the expected behavior is also respected:

extern crate scraper;

fn main() {
    let data = r#"
    <data>
    <br />
    <b />
    </data>"#;
    let html = scraper::Html::parse_document(&data);
    println!("{}", html.html());
}

Will result in:

<html><head></head><body><data>
    <br>
    <b>
    </b></data></body></html>

sandrohanea avatar Oct 07 '25 10:10 sandrohanea