Nested lists get sanitised incorrectly
An invalid list, still parsed correctly by browsers and email clients of the sort:
<ul>
<li>Topic 1</li>
<ul>
<li>Item 1</li>
<li>Item 2</li>
<ul>
<li>Sub 1 </li>
<li>Sub 2</li>
<li>Sub 3</li>
</ul>
<li>Item 3</li>
</ul>
</ul>
Gets parsed as:
<ul>
<li>Topic 1</li>
<li>
<ul>
<li>Item 1</li>
<li>Item 2</li>
<li>
<ul>
<li>Sub 1 </li>
<li>Sub 2</li>
<li>Sub 3</li>
</ul>
</li>
<li>Item 3</li>
</ul>
</li>
</ul>
which is definitely not the intention of the original html. The proper desired output would be:
<ul>
<li>Topic 1
<ul>
<li>Item 1</li>
<li>Item 2
<ul>
<li>Sub 1 </li>
<li>Sub 2</li>
<li>Sub 3</li>
</ul>
</li>
<li>Item 3</li>
</ul>
</li>
</ul>
Another proper desired solution would be to allow a change in this tag balancing behaviour.
I see HtmlElementTables.impliedElements handles this logic and there is even this comment
// If we require above that all <li>s appear in a <ul> or <ol> then
// for symmetry, we require here that all content of a <ul> or <ol> appear
// nested in a <li>.
// This does not have the same security implications as the above, but is
// symmetric.
But there is no way of overriding this behaviour
If the sanitizer produces structurally invalid markup, there's a greater risk of HTML parsers getting confused so I can't do that.
If I allow overriding umpteen behaviors, then testing with a handful of configurations doesn't give confidence in most of the possible configurations.
If I were to add a class to elements synthesized in this way, then you could try to fix it up with some CSS.
<style>
li.synthesized { list-style: none }
</style>
<ul>
<li>Topic 1</li>
<li class="synthesized"><ul>
<li>Item 1</li>
<li>Item 2</li>
<li class="synthesized"><ul>
<li>Sub 1 </li>
<li>Sub 2</li>
<li>Sub 3</li>
</ul></li>
<li>Item 3</li>
</ul></li>
</ul>
Unfortunately, my CSS isn't strong enough to know whether there's a way to disable list-style when the only child of an <li> is a <ul> or <ol>. Something like the below would work, but isn't supported in CSS IIRC:
li:has(> ul:only-child), li:has(> ol:only-child) {
list-style: none
}
It seems to me that arguing that an invalid meaning has a precise meaning is twisting the meaning of "invalid". Within a <ul> or <ol> all non-whitespace content has to be inside a <li> tag. Upon encountering content outside of a <li> there are four possibilities:
- Discard it
- Surround it in a
<li> - Add it to the previous
<li>if there is one, otherwise do what? - Add it to the succeeding
<li>if there is one, otherwise do what?
Obviously none of these are definitively correct, because the input is invalid. Furthermore, assuming a tag may extends past its explicit end as @pcelentano argues for seems a perverse treatment of a valid tag.
In this case option 2 is repeatable everywhere and does not discard potentially important content. I believe the current implementation provides the best solution possible.