php-htmldiff icon indicating copy to clipboard operation
php-htmldiff copied to clipboard

HTML entities are not kept intact, leading to invalid HTML

Open aboks opened this issue 2 years ago • 0 comments

The diff algorithm does not always keep HTML entity references intact, which can be problematic when later loading the resulting diff as HTML (e.g. when diffing a list entry). This is best illustrated by the following test:

<options></options>

<oldText>
    <ol><li>Lorem ipsum dolor sit amet, consectetur adipiscing elit. Donec non justo &amp; sapien;</li></ol>
</oldText>

<newText>
    <ol><li>Lorem ipsum dolor sit amet, consectetur adipiscing elit. Donec non sapien et justo;</li></ol>
</newText>

<expected>
    <ol class="diff-list"><li class="normal">Lorem ipsum dolor sit amet, consectetur adipiscing elit. Donec non <ins class="diffins">sapien et </ins>justo<del class="diffdel"> &amp; sapien</del>;</li></ol>
</expected>

This test currently crashes with an error message DOMDocument::loadHTML(): htmlParseEntityRef: expecting ';' in Entity, line: 1 from ListDiffLines.php:409. This is because it tries to load the string

<body>Lorem ipsum dolor sit amet, consectetur adipiscing elit. Donec non <ins class="diffins">sapien et </ins>justo<del class="diffdel"> &amp</del>;<del class="diffdel"> sapien;</del></body>

You can see that the diff algorithm broke up the &amp from its terminating ;, which leads to invalid HTML.

The solution might be to consider HTML entities a special case in the regex in AbstractDiff.php line 457 (e.g. adding an extra &[a-zA-Z0-9]+; case there). That fixes the given test without breaking the others, but I cannot oversee what further possible impact that may have.

aboks avatar Feb 07 '23 14:02 aboks