html2text
html2text copied to clipboard
Improve "r_unescape" regular expression to skip invalid HTML entities
Some invalid HTML entities (ex: &#a;) are still being matched by the regular expression r_unescape, which result in error
Example scenario
html = "<html><body><input name='opt in for&#a;todoist.com&#a;new site' /><p>hihi</p><body></html>"
plaintext = html2text.html2text(html)
Error traceback:
File "todoist/scripts/test.py", line 16, in <module>
plaintext = html2text.html2text(html)
File "/home/vagrant/todoist/libs/ist_libs/python/html2text.py", line 812, in html2text
File "/home/vagrant/todoist/libs/ist_libs/python/html2text.py", line 252, in handle
File "/home/vagrant/todoist/libs/ist_libs/python/html2text.py", line 249, in feed
File "/usr/lib/python2.7/HTMLParser.py", line 117, in feed
self.goahead(0)
File "/usr/lib/python2.7/HTMLParser.py", line 161, in goahead
k = self.parse_starttag(i)
File "/usr/lib/python2.7/HTMLParser.py", line 308, in parse_starttag
attrvalue = self.unescape(attrvalue)
File "/home/vagrant/todoist/libs/ist_libs/python/html2text.py", line 715, in unescape
File "/home/vagrant/todoist/libs/ist_libs/python/html2text.py", line 710, in replaceEntities
File "/home/vagrant/todoist/libs/ist_libs/python/html2text.py", line 685, in charref
ValueError: invalid literal for int() with base 10: 'a'