fluent.js HTML parser used for markup in translations breaks self-closing elements

Both fluent-dom and fluent-react currently use the <template> element to safely parse markup in translations. The <template> element internally uses the HTML parser. This is similar in effect to using DOMParser.parseFromString with text/html as supported type.

The HTML parser has rather strict rules on parsing void elements, unlike the XML parser. It only allows the elements known to be void (e.g. input) to be self-closing (<element/>). Non-void elements are parsed as if they weren't properly closed.

This means that the following markup:

A <span/> B

…is parsed as:

A <span> B</span>

The consequence for overlays is that localizers can't use the self-closing form with non-void elements. Admittedly, they're not valid HTML markup, but the difference is rather subtle:

# This inserts "!" into the span.
hello = Hello, <span data-l10n-name="user" />!

# This works as expected.
hello = Hello, <span data-l10n-name="user"></span>!

Apr 17 '18 16:04 stasm

One way to fix this would be to use DOMParser with the application/xml type for parsing markup in translations. XML has stricter parsing rules in general but it also allows any elements to be self-closing and it still produces well-formed documents.

We'd need to add the DTD for at least some HTML entities that localizers might expect to work. The full DTD is huge, clocking in at over 11KB (preview). Perhaps we could only cherry pick and support whitespace characters, like   and ? In any case, it would be useful to strictly define the list of entities that we support.

Apr 17 '18 16:04 stasm

I think the practical impact of not supporting this is very limited.

From my experience reviewing sign-offs, I don't think I've ever seen localizers changing the markup used in the reference string. If it's <span></span>, it remains unchanged.

In some cases I've seen <img id="foo" /> losing the space before the closing slash, but that's as far as changes go.

Apr 18 '18 09:04 flodolo

I think I ran into this, I tried to do this:

<Localized
  id="comments-userBoxAuthenticated-signedInAs"
  username={<Username />}
>
  <Typography variant="bodyCopy" container="div"> 
    {"Signed in as <username />."}
  </Typography>
</Localized>

comments-userBoxAuthenticated-signedInAs =
  Signed in as <username/>.

But the . is swallowed at the end while this works:

<Localized
  id="comments-userBoxAuthenticated-signedInAs"
  username={<Username />}
>
  <Typography variant="bodyCopy" container="div"> 
    {"Signed in as <username></username>."}
  </Typography>
</Localized>

comments-userBoxAuthenticated-signedInAs =
  Signed in as <username></username>.

Aug 27 '18 12:08 cvle

Yes, this looks familiar. I wrote a test to document the current (broken) behavior: https://github.com/projectfluent/fluent.js/blob/6e1ab3a4982ef6d5f385c77d15a16b27abc82a2e/fluent-react/test/localized_overlay_test.js#L618-L643

Do you think that using <username></username> is an acceptable work-around?

Aug 27 '18 12:08 stasm

Do you think that using is an acceptable work-around?

It's not pretty, and it goes against user expectations. The workaround is simple and functioning though, however I'd like to see this as a temporary solution only.

Aug 27 '18 23:08 cvle

I'm not fond of it either. Perhaps using DOMParser and treating translations as XML rather than HTML, would work better? I'd need to check the performance impact, too.

Aug 28 '18 14:08 stasm

More importantly, <template> creates an inert DocumentFragment, which is also good for security.

Aug 28 '18 14:08 stasm

Other solutions include using one of the available HTML-to-React parsers, like html-react-parser or react-html-parser. They add quite much to the size of fluent-react, however (20KB and 70KB respectively).

I suspect they ship with so much code to support a lot of HTML's edge-cases. There might be a limited subset of safe markup which we could write our own parser for. I'll be away for a few weeks, but I can take a closer look when I'm back.

Aug 28 '18 14:08 stasm

I'm not fond of it either. Perhaps using DOMParser and treating translations as XML rather than HTML, would work better? I'd need to check the performance impact, too.

I think that would be a nice solution.

We're using fluent-react in Firefox DevTools, and the document can be an xhtml one, and in such case custom tags are ignored:

temp =  document.createElement("template")
temp.innerHTML = "<a>test</a>|<custom>testCustom</custom>"
Array.from(temp.content.childNodes) // [ a, #text, #text ]

whereas with DOMParser it works as expected:

parser = new DOMParser()
doc = parser.parseFromString("<a>test</a>|<custom>testCustom</custom>", "text/html")
Array.from(doc.body.childNodes) // [ a, #text, custom ]

And sometimes we don't even have a document reference, which makes some of panel to fail loading when trying to update our version of fluent-react to 0.16.1

Feb 05 '21 08:02 nchevobbe

Changing the parser logic can introduce subtle regressions, the HTML fixup is probably different between document and element. And it needs a security review, as Stas mentioned.

I still like the lack of XML complexity in our DOM overlay stories.

Feb 05 '21 11:02 Pike