Add support in regular expressions for UTF-8 whitespace detection
We ran across a nasty bug at Braze where a customer was supplying the UTF-8 non-breaking space character in a Liquid template they were providing to us, and it took a very long time to debug why it was not parsing correctly. It turns out that the user-supplied Liquid string had some UTF-8 non-breaking spaces in it, which the current regular expressions do not count as whitespace (\s only includes ASCII whitespace, while [[:space:]] includes ASCII and UTF-8 whitespace characters).
I replaced \s everywhere, but I added a single test case that red-greens against the existing code. Getting full coverage of every possibility seemed excessive, although I'm open to implementing more thorough tests if it's needed before merging.
Co-authored-by: Chris Watkins [email protected]
I have signed the CLA!
It may also be smart to replace \w with [[:word:]] to work properly with non-ASCII word characters as well, however I would imagine those are easier to spot visually and probably don't get accidentally used.