Incorrect handling of UTF-8 encoding during preprocessing
Example:
cppreference-doc-20250209\reference\en.cppreference.com\w\cpp\header\bit.html:4 which is the page title
- the raw dump has it correctly:
<title>Standard library header <bit> (C++20) - cppreference.com</title>it's just UTF-8 encoded - the html in the zip (also in the .tar.xz) has it twice encoded:
html-book-20250209.zip\reference\en\cpp\header\bit.html
<title>Standard library header <bit>Â (C++20) - cppreference.com</title>adding \u00C2 kruft
I'd be happy to look into it.
Yep, looks like a bug. I'm assuming we need to explicitly set the source encoding somewhere during preprocessing.
https://github.com/PeterFeicht/cppreference-doc/blob/be3ce3c82f280fc5bfc09d29ceffc2a236bd90e6/commands/preprocess.py#L389
HTMLParser(encoding='utf-8')
I'll post a PR shortly.
I got sucked into a slightly bigger change using BeautifulSoup which makes things simpler and more robust. Strangely MediaWiki renders wonky HTML mixing up end tags order e.g. <div><ul><li></li></div></ul>. But that should be a different PR and a whole new conversation.