compiler icon indicating copy to clipboard operation
compiler copied to clipboard

fix: parse `select` correctly when it contains HTML element

Open lbrooney opened this issue 1 year ago • 7 comments

Closes #1069. Select was the problem not options, but that issue was prior to me looking at the WHATWG documentation and the Astro parser.

Changes

  • This fixes parsing of select elements when they contain HTML elements.
  • The issue is specifically with how End Tags were being handle when inSelect mode
    • Only option, optgroup, select, and template end tags were being handled.
    • Others were unhandled, this PR just changes so that if the end tag is not any of the ones from the previous bullet point, the parser should treat the tag as if it were inBody.
    • Start tag handles it this way already.
  • Previously the compiler could correctly produce a select containing a single HTML element, which I'll refer to as X, or a select containing an option containing a single X.
  • However if a HTML element, Y, is added as the next sibling to either the select, option, or X, Y will cannibalized by X becoming its child. Each subsequent element(child or sibling) will have whatever element was preceding it as its parent.

ie:

<select id="select-span-sibling">
  <span>Lemon</span>
  <option>Lime</option>
</select>
<div><span>Orange</span></div>
<div>Melon</div>

produces (formatted for readability):

<select id="select-span-sibling">
  <span>
    Lemon
    <option>
      Lime
    </option>
    <div>
      <span>
        Orange
        <div>
          Melon
        </div>
      </span>
    </div>
  </span>
</select>
  • This change fixes it so that the previous example now produces the expected output:
<select id="select-span-sibling">
  <span>Lemon</span> 
  <option>Lime</option>
</select>
<div><span>Orange</span></div>
<div>Melon</div>
  • I briefly tested the compiled WASM in a local Astro project and the generated HTML from the .astro was matching the expected/"correct" output.
  • I don't know if there are additional changes needed to be made to handle special cases or better match the spec or if this somehow broke how people were previously using it, but I believe this PR produces a more correct output than before.

Testing

I wasn't sure where best to do testing so I just added to test_printer where I could most immediately observe outputs. Additionally I wasn't sure what was the appropriate amount of tests I should add or how verbose to be with the naming, so I added only a few to cover some limited scenarios and it's far from exhaustive. I can add, remove, or change them if it is desired.

Docs

I did not change any documentation as I believe it's only a bugfix.

lbrooney avatar Apr 24 '25 20:04 lbrooney

🦋 Changeset detected

Latest commit: f7af8186f432a7b6dcb21fd76eb66f48a289640a

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 1 package
Name Type
@astrojs/compiler Patch

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

changeset-bot[bot] avatar Apr 24 '25 20:04 changeset-bot[bot]

@lbrooney are you sure about the validity of the bug? The w3c validator returns an error when span is a child of select

ematipico avatar Apr 28 '25 15:04 ematipico

@lbrooney are you sure about the validity of the bug? The w3c validator returns an error when span is a child of select

So I believe you are correct in that a span is not a valid child of a select(if my understanding is correct the current spec says anything not explicity defined is an error).

To my knowledge, currently a span in an option is also not spec compliant either, but considering as of Chrome 135, with the introduction of customizable selects and the ability for them to contain rich HTML content that is changing. There is an active spec pull request here: whatwg/html#9799. Additional details can also be found in the previously linked Chrome dev blog as well.

Admittedly my example in the initial PR is bad as I'm not sure that even within the new spec changes whether that would be valid, but as of right now Chrome doesn't raise any concerns about it, it exists within the DOM, but it not visible/presented(maybe with some CSS you can make it visible). Apologies, however, I do think it demonstrated how broken the output produced could be.

I probably should have included another example in my opening issue, but for instance:

<select>
  <option><span>Lemon</span></option>
  <option><span>Lime</span></option>
</select>

currently produces(formatted):

<select>
  <option>
    <span>Lemon
      <option>
        <span>Lime</span>
      </option>
    </span>
  </option>
</select>

Taking advantage of this new feature is currently impossible within an Astro component.

Blink currently has this behind a runtime flag (the function above it does the actual check). Checking the references you can see how they've changed parsing, with html_tree_builder.cc I believe being most relevent in this case.

lbrooney avatar Apr 28 '25 16:04 lbrooney

Thanks for contributing!

Looking at the updated spec and this https://chromium-review.googlesource.com/c/chromium/src/+/5118452/10/third_party/blink/renderer/core/html/parser/html_tree_builder.cc the change looks a bit more involved than parsing with the inBody insertion mode.

Wondering if we should get rid of HTML error correction. Browsers are already optimized to do that, and it would avoid having to keep up with the HTML spec.

MoustaphaDev avatar May 03 '25 01:05 MoustaphaDev

I posted a proposal to discuss that!

https://github.com/withastro/roadmap/discussions/1166

MoustaphaDev avatar May 05 '25 16:05 MoustaphaDev

Thanks for contributing!

For why my solution is rather terse is that while, to my understanding, the new select spec is largely finalized, I didn't want to engineer a spec-for-spec implementation only for it to change and have it now be incorrect (long way to say I was lazy 😅). The current changes introduced by the PR solved my problems so I was satisfied with it. To your second point, if the produced HTML with this solution was not spec compliant, the browser would handle it, which while not ideal is well within their capabilities.

I can push some additional snapshot tests, but with the current PR if a select tag (properly closed/opened or not) is a child of an an element within a select, it is does not exist in the produced output.

<!-- any of these inputs -->
<select><option><span><select>Lemon</select></span></option></select>
<select><option><span><select></select>Lemon</span></option></select>
<select><option><span><select>Lemon</span></option></select>
<select><option><span></select>Lemon</span></option></select>
<!-- same output by astro -->
<select><option><span>Lemon</span></option></select>`

The behaviour on handled cases within inSelect.

<!-- select tag in option any of these inputs -->
<select><option><select></select><span>Lemon</span></option></select>
<select><option><select><span>Lemon</span></option></select>
<select><option></select><span>Lemon</span></option></select>
<!-- same output by astro -->
<select><option></option></select><span>Lemon</span>

I posted a proposal to discuss that!

withastro/roadmap#1166

In regards to your proposal, I was having similar thoughts when I initally came across this bug and found that I didn't encounter it with React TSX. I believe that's essentially how the current Astro TSX parser workers by just ignoring parse errors, and I may be incorrect in this, I believe React TSX also doesn't "correct" improper HTML.

lbrooney avatar May 05 '25 19:05 lbrooney

I believe React TSX also doesn't "correct" improper HTML.

How / why should it? All the DOM knowledge is in react-dom and JSX is constructed via simple rules - which do not follow HTML(5) rules. The tree is constructed just by using methods such as createElement and appendChild, not via the HTML(5) parser (which would take a string to be parsed as a fragment, i.e., be much slower and not synced with the VDOM).

In short React / JSX does not deal at all with HTML - just with DOM nodes. The only time innerHTML is used is via the special dangerouslySetInnerHTML prop on "native" elements (i.e., handled at runtime via react-dom).

FlorianRappl avatar May 25 '25 10:05 FlorianRappl