cppreference-doc Incorrect handling of UTF-8 encoding during preprocessing

Example:

cppreference-doc-20250209\reference\en.cppreference.com\w\cpp\header\bit.html:4 which is the page title

the raw dump has it correctly: <title>Standard library header <bit> (C++20) - cppreference.com</title> it's just UTF-8 encoded
the html in the zip (also in the .tar.xz) has it twice encoded: html-book-20250209.zip\reference\en\cpp\header\bit.html <title>Standard library header <bit>Â (C++20) - cppreference.com</title> adding \u00C2 kruft

I'd be happy to look into it.

Apr 20 '25 14:04 refack

Yep, looks like a bug. I'm assuming we need to explicitly set the source encoding somewhere during preprocessing.

Apr 25 '25 06:04 PeterFeicht

https://github.com/PeterFeicht/cppreference-doc/blob/be3ce3c82f280fc5bfc09d29ceffc2a236bd90e6/commands/preprocess.py#L389

HTMLParser(encoding='utf-8')

I'll post a PR shortly. I got sucked into a slightly bigger change using BeautifulSoup which makes things simpler and more robust. Strangely MediaWiki renders wonky HTML mixing up end tags order e.g. <div><ul><li></li></div></ul>. But that should be a different PR and a whole new conversation.

Apr 25 '25 14:04 refack