Add charset info to the clean html
Thank you for keeping up the project!
I use readability to extract the article and then save it as html. Today I've run into problem when Chrome didn't display some unicode characters correctly (the .html file was saved with utf-8). Turns out that it got solved by adding the following line to the cleaned html:
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
Maybe it could be considered to add this info as a default behavior to get_clean_html().
Thank you very much!
Hi rsuhada,
How do you access that get_clean_html() method?
The Document class usually attempts to guess the document encoding... But it might have failed for you. Then you need to specify it manually, with subclassing or using the encoding field.
I'm not sure how to implement encoding guessing transparently for those who need it, and omit it for ones who know their document encoding... Any ideas?
Hi buriy,
Here is what I do:
res = requests.get(url)
article = Document(res.text)
article_clean_html = article.get_clean_html()
with codecs.open("test_clean.html", encoding="utf-8", mode="w") as f:
f.write(article_clean_html)
When I open the test_clean.html, I see buggy unicode characters. All is good, if I include the charest info into the html:
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
The encoding can be taken from res.encoding returned by the requests.get().
Alternatively - is this actually the correct way to use (and saving the output) the readability package?
Thank you!
Ok, got you now.
You can do:
with codecs.open("test_clean.html", encoding="utf-8", mode="w") as f:
f.write('<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />')
f.write(article_clean_html)
But I agree that could be an enhancement that a lib could do if it adds a
tag to the HTML.