Rendered output with auto linking enabled produces unicode surrogates which can not be encoded
Describe the bug
When using auto linking (linkify) the generated output may contain unicode surrogates which can not be encoded in Python 3 without further action.
AFIAK changed encode from Python 2 to Python 3 to be more strict which is in accordance to the unicode spec (see Programming with Unicode for example).
Reproduce the bug
For the following input
>>> markdown = 'https://host.invalid/%F0%9F%91%A9'
I expected the rendered output to be
>>> from markdown_it import MarkdownIt
>>> html_renderer = MarkdownIt(config="gfm-like", options_update={'breaks': True, 'html': False})
>>> html = html_renderer.render(markdown)
>>> html
'<p><a href="https://host.invalid/%F0%9F%91%A9">https://host.invalid/👩</a></p>\n'
instead I got
>>> html
'<p><a href="https://host.invalid/%F0%9F%91%A9">https://host.invalid/\ud83d\udc69</a></p>\n'
where \ud83d\udc69 is a unicode surrogate pair. Surrogates can not be encoded in Python 3, so I got this error
>>> html.encode('utf-8')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'utf-8' codec can't encode characters in position 68-69: surrogates not allowed
List your environment
Python 3.9 markdown-it-py 2.1.0
As a workaround one can use the surrogatepass error handler to convert the surrogates into normal unicode code points:
>>> html = html.encode("utf-16", errors="surrogatepass").decode("utf-16")
>>> html
'<p><a href="https://host.invalid/%F0%9F%91%A9">https://host.invalid/👩</a></p>\n'
But, in my opinion, the rendered output should not contain any surrogates in the first place.
I think I tracked it down to https://github.com/executablebooks/mdurl/blob/a0f259c699eb2f75b7df290ed1e731f9b27ee171/src/mdurl/_decode.py#L33 used here https://github.com/executablebooks/markdown-it-py/blob/7e677c4e7b4573eaf406a13882f3fee4b19b97f4/markdown_it/common/normalize_url.py#L40
which decodes a percent-encoded string into unicode with surrogates, which is correct for Javascript (which still uses UTF-16 internally to my knowledge), but harmful for Python.
Thanks for the issue, and for tracking it down to normalizeLinkText!
This should now be fixed with mdurl==0.1.2.
@hukkin Thank you for the fix!