cppreference-doc icon indicating copy to clipboard operation
cppreference-doc copied to clipboard

Incorrect handling of UTF-8 encoding during preprocessing

Open refack opened this issue 9 months ago • 2 comments

Example:

cppreference-doc-20250209\reference\en.cppreference.com\w\cpp\header\bit.html:4 which is the page title

  • the raw dump has it correctly: <title>Standard library header &lt;bit> (C++20) - cppreference.com</title> it's just UTF-8 encoded Image
  • the html in the zip (also in the .tar.xz) has it twice encoded: html-book-20250209.zip\reference\en\cpp\header\bit.html <title>Standard library header &lt;bit&gt; (C++20) - cppreference.com</title> adding \u00C2 kruft Image

I'd be happy to look into it.

refack avatar Apr 20 '25 14:04 refack

Yep, looks like a bug. I'm assuming we need to explicitly set the source encoding somewhere during preprocessing.

PeterFeicht avatar Apr 25 '25 06:04 PeterFeicht

https://github.com/PeterFeicht/cppreference-doc/blob/be3ce3c82f280fc5bfc09d29ceffc2a236bd90e6/commands/preprocess.py#L389

HTMLParser(encoding='utf-8')

I'll post a PR shortly. I got sucked into a slightly bigger change using BeautifulSoup which makes things simpler and more robust. Strangely MediaWiki renders wonky HTML mixing up end tags order e.g. <div><ul><li></li></div></ul>. But that should be a different PR and a whole new conversation.

refack avatar Apr 25 '25 14:04 refack