java-html-sanitizer icon indicating copy to clipboard operation
java-html-sanitizer copied to clipboard

Nested lists get sanitised incorrectly

Open pcelentano opened this issue 5 years ago • 2 comments

An invalid list, still parsed correctly by browsers and email clients of the sort:

<ul>
 <li>Topic 1</li>
 <ul>
  <li>Item 1</li>
  <li>Item 2</li>
  <ul>
   <li>Sub 1 </li>
   <li>Sub 2</li>
   <li>Sub 3</li>
  </ul>
  <li>Item 3</li>
 </ul>
</ul>

Gets parsed as:

<ul>
  <li>Topic 1</li>
  <li>
    <ul>
      <li>Item 1</li>
      <li>Item 2</li>
      <li>
        <ul>
          <li>Sub 1 </li>
          <li>Sub 2</li>
          <li>Sub 3</li>
        </ul>
      </li>
      <li>Item 3</li>
    </ul>
  </li>
</ul>

which is definitely not the intention of the original html. The proper desired output would be:

<ul>
  <li>Topic 1
    <ul>
      <li>Item 1</li>
      <li>Item 2
        <ul>
          <li>Sub 1 </li>
          <li>Sub 2</li>
          <li>Sub 3</li>
        </ul>
      </li>
      <li>Item 3</li>
    </ul>
  </li>
</ul>

Another proper desired solution would be to allow a change in this tag balancing behaviour. I see HtmlElementTables.impliedElements handles this logic and there is even this comment

    // If we require above that all <li>s appear in a <ul> or <ol> then
    // for symmetry, we require here that all content of a <ul> or <ol> appear
    // nested in a <li>.
    // This does not have the same security implications as the above, but is
    // symmetric.

But there is no way of overriding this behaviour

pcelentano avatar Sep 22 '20 13:09 pcelentano

If the sanitizer produces structurally invalid markup, there's a greater risk of HTML parsers getting confused so I can't do that.

If I allow overriding umpteen behaviors, then testing with a handful of configurations doesn't give confidence in most of the possible configurations.

If I were to add a class to elements synthesized in this way, then you could try to fix it up with some CSS.

<style>
  li.synthesized { list-style: none }
</style>

<ul>
 <li>Topic 1</li>
 <li class="synthesized"><ul>
  <li>Item 1</li>
  <li>Item 2</li>
  <li class="synthesized"><ul>
   <li>Sub 1 </li>
   <li>Sub 2</li>
   <li>Sub 3</li>
  </ul></li>
  <li>Item 3</li>
 </ul></li>
</ul>

Unfortunately, my CSS isn't strong enough to know whether there's a way to disable list-style when the only child of an <li> is a <ul> or <ol>. Something like the below would work, but isn't supported in CSS IIRC:

li:has(> ul:only-child), li:has(> ol:only-child) {
  list-style: none
}

mikesamuel avatar Sep 22 '20 14:09 mikesamuel

It seems to me that arguing that an invalid meaning has a precise meaning is twisting the meaning of "invalid". Within a <ul> or <ol> all non-whitespace content has to be inside a <li> tag. Upon encountering content outside of a <li> there are four possibilities:

  1. Discard it
  2. Surround it in a <li>
  3. Add it to the previous <li> if there is one, otherwise do what?
  4. Add it to the succeeding <li> if there is one, otherwise do what?

Obviously none of these are definitively correct, because the input is invalid. Furthermore, assuming a tag may extends past its explicit end as @pcelentano argues for seems a perverse treatment of a valid tag.

In this case option 2 is repeatable everywhere and does not discard potentially important content. I believe the current implementation provides the best solution possible.

simon-greatrix avatar Feb 06 '21 15:02 simon-greatrix