markdown-it-py Rendered output with auto linking enabled produces unicode surrogates which can not be encoded

Describe the bug

When using auto linking (linkify) the generated output may contain unicode surrogates which can not be encoded in Python 3 without further action.

AFIAK changed encode from Python 2 to Python 3 to be more strict which is in accordance to the unicode spec (see Programming with Unicode for example).

Reproduce the bug

For the following input

>>> markdown = 'https://host.invalid/%F0%9F%91%A9'

I expected the rendered output to be

>>> from markdown_it import MarkdownIt
>>> html_renderer = MarkdownIt(config="gfm-like", options_update={'breaks': True, 'html': False})
>>> html = html_renderer.render(markdown)
>>> html
'<p><a href="https://host.invalid/%F0%9F%91%A9">https://host.invalid/👩</a></p>\n'

instead I got

>>> html
'<p><a href="https://host.invalid/%F0%9F%91%A9">https://host.invalid/\ud83d\udc69</a></p>\n'

where \ud83d\udc69 is a unicode surrogate pair. Surrogates can not be encoded in Python 3, so I got this error

>>> html.encode('utf-8')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'utf-8' codec can't encode characters in position 68-69: surrogates not allowed

List your environment

Python 3.9 markdown-it-py 2.1.0

Aug 10 '22 07:08 mib112

As a workaround one can use the surrogatepass error handler to convert the surrogates into normal unicode code points:

>>> html = html.encode("utf-16", errors="surrogatepass").decode("utf-16")
>>> html
'<p><a href="https://host.invalid/%F0%9F%91%A9">https://host.invalid/👩</a></p>\n'

But, in my opinion, the rendered output should not contain any surrogates in the first place.

Aug 10 '22 07:08 mib112

I think I tracked it down to https://github.com/executablebooks/mdurl/blob/a0f259c699eb2f75b7df290ed1e731f9b27ee171/src/mdurl/_decode.py#L33 used here https://github.com/executablebooks/markdown-it-py/blob/7e677c4e7b4573eaf406a13882f3fee4b19b97f4/markdown_it/common/normalize_url.py#L40

which decodes a percent-encoded string into unicode with surrogates, which is correct for Javascript (which still uses UTF-16 internally to my knowledge), but harmful for Python.

Aug 10 '22 07:08 mib112

Thanks for the issue, and for tracking it down to normalizeLinkText!

This should now be fixed with mdurl==0.1.2.

Aug 14 '22 12:08 hukkin

@hukkin Thank you for the fix!

Aug 29 '22 08:08 mib112