markdown-it-py icon indicating copy to clipboard operation
markdown-it-py copied to clipboard

Rendered output with auto linking enabled produces unicode surrogates which can not be encoded

Open mib112 opened this issue 3 years ago • 2 comments

Describe the bug

When using auto linking (linkify) the generated output may contain unicode surrogates which can not be encoded in Python 3 without further action.

AFIAK changed encode from Python 2 to Python 3 to be more strict which is in accordance to the unicode spec (see Programming with Unicode for example).

Reproduce the bug

For the following input

>>> markdown = 'https://host.invalid/%F0%9F%91%A9'

I expected the rendered output to be

>>> from markdown_it import MarkdownIt
>>> html_renderer = MarkdownIt(config="gfm-like", options_update={'breaks': True, 'html': False})
>>> html = html_renderer.render(markdown)
>>> html
'<p><a href="https://host.invalid/%F0%9F%91%A9">https://host.invalid/👩</a></p>\n'

instead I got

>>> html
'<p><a href="https://host.invalid/%F0%9F%91%A9">https://host.invalid/\ud83d\udc69</a></p>\n'

where \ud83d\udc69 is a unicode surrogate pair. Surrogates can not be encoded in Python 3, so I got this error

>>> html.encode('utf-8')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'utf-8' codec can't encode characters in position 68-69: surrogates not allowed

List your environment

Python 3.9 markdown-it-py 2.1.0

mib112 avatar Aug 10 '22 07:08 mib112

As a workaround one can use the surrogatepass error handler to convert the surrogates into normal unicode code points:

>>> html = html.encode("utf-16", errors="surrogatepass").decode("utf-16")
>>> html
'<p><a href="https://host.invalid/%F0%9F%91%A9">https://host.invalid/👩</a></p>\n'

But, in my opinion, the rendered output should not contain any surrogates in the first place.

mib112 avatar Aug 10 '22 07:08 mib112

I think I tracked it down to https://github.com/executablebooks/mdurl/blob/a0f259c699eb2f75b7df290ed1e731f9b27ee171/src/mdurl/_decode.py#L33 used here https://github.com/executablebooks/markdown-it-py/blob/7e677c4e7b4573eaf406a13882f3fee4b19b97f4/markdown_it/common/normalize_url.py#L40

which decodes a percent-encoded string into unicode with surrogates, which is correct for Javascript (which still uses UTF-16 internally to my knowledge), but harmful for Python.

mib112 avatar Aug 10 '22 07:08 mib112

Thanks for the issue, and for tracking it down to normalizeLinkText!

This should now be fixed with mdurl==0.1.2.

hukkin avatar Aug 14 '22 12:08 hukkin

@hukkin Thank you for the fix!

mib112 avatar Aug 29 '22 08:08 mib112