Make attribute value and text content escaping more conforming
Fixes #216 Supersedes and closes #217
The HTML spec defines which characters to escape when serializing HTML. This got changed recently to also serialize < and > in attribute values.
This aligns with the DOM Parsing and Serialization spec which defines its own rule for escaping attribute values and for escaping text nodes, both requiring these two characters to always be escaped.
HTML also requires non-breaking spaces to be escaped (in HTML). Also browsers de facto serialize tabs, line feeds and carriage returns in attribute values in XML (there is no spec for this, but there is a web platform test). JSDOM does this too.
This PR implements those requirements.
This PR also brings another fix about <title> elements which used to be considered a special case when serializing in this codebase, while i couldn't find anything in the spec about this (the <title> element is not in this list). Instead, we customize its innerHTML setter to escape everything, which prevents child elements to be added to them through this property (for HTML documents only).
I think this is a breaking change … and title doesn’t need escaping or it results as title escaped … if you’re after JSDOM compliance I think you should stop trying to make LinkeDOM JSDOM compliant but if you need for real world, not testing related reasons, these changes, please be careful with suggested changes because this library is not just a mock target, it can render in workers and edge cloud scenarios
to clarify, this is a perfectly valid document.title every browser will understand:
<title>1 <and>&</and> 2</title>
it's a special node that acts like style or script (a data node) and I am not sure it needs fixing as it is right now ... true that if html-escape should consider non breaking spaces too I need to fix this library too and these changes will be welcome but it's weird nobody ever complained to date about that library not catching the explicitly which is why I think this might be a breaking thing in here ... do you have concrete examples where the current state breaks the Web? Thank you!
My general use case for the three PRs i have opened is actually the same one. I'm doing SSR of XHTML and SVG documents and until now i was indeed using JSDOM for this. However i'm starting to face performance issues, as JSDOM implements many things i don't need. Basically all i need is parsing, basic DOM manipulation (setting attributes and adding nodes) and serializing the whole thing, and LinkeDOM seems to fit my needs. Thanks for making it!
The data i generate is versioned though, so while this not about testing, i indeed want to minimize the diffs, and together, the PRs i have opened achieve avoiding having any diff on my data.
That said, the particular issues i'm trying to fix with this PR are:
-
generating malformed XHTML documents that browsers can't open, due to
<and>not being escaped in attribute values. -
attribute values containing tabs and newlines not eventually roundtripping, resulting in a data loss. XML parsers in browsers normalize unescaped tabs and line feeds as spaces. I'm displaying attributes with CSS, so i want to keep these characters and that requires escaping them, as specced.
-
some XSS issues with the current
<title>implementation, that i realized while looking at this test:const {document} = parseHTML("<html><head><title>Title</title></head></html>"); document.title = "</title><script>alert(1)</script>"; // or document.querySelector("title").textContent = "</title><script>alert(1)</script>"; // or document.querySelector("title").innerHTML = "</title><script>alert(1)</script>"; document.toString(); // <html><head><title></title><script>alert(1)</script></title></head></html>This PR fixes the three XSS issues above by:
- not overriding
titleElement.toString() - not delegating
titleElement.innerHTMLtotitleElement.textContent - escaping values provided to
titleElement.innerHTML's setter
That last item is precisely the special-case you're talking about in the HTML parser: most of the time you don't have to (double) escape the title content yourself as the engine (LinkeDOM here) is supposed to do it for you. Should i add this example as testcases?
- not overriding
While looking at the spec, i also fixed other escaping cases i encountered. is just one of these so I don't have any use case for it to be honest (as i'm focusing on XHTML rather than on HTML).
JSDOM implements many things i don't need
this is the key and reason this project is fast ... so let's talk about this: why would you need attributes insertion order, as example or why should I add complexity for things nobody practically cares about?
I am afraid I'll be a push back for anything not really interesting, relevant, solving anything useful and so on ... there is a bug? PR welcome ... there is a not 100% strict behavior nobody cares about or use in the real world? PR not welcome as I don't care maintaining those things and neither should you.
I think there are different things to consider here:
- XSS on the server feels like FUD to me ... you produce the output and if you do that after your client side users pass arbitrary strings to produce such output you have way more issues than yourself creating XSS to your own site ... it's like stating that if you have secrets in your server these might leak ... well, of course if you don't know what library you are using, otherwise that's just not how XSS works as XSS requires foreign unknown embedding of code, here, on the backend, there's not such thing unless you implement that
- attributes should be escaped or not breaking ... this is the only fix worth it to me
- title might sanitizaiton but I need tests that confirm current scenario won't break
So if we could simplify and split the MR in different precise fixes with tests, that'd be awesome ... otherwise it'll stale