Control HTML document Unicode decoding
Requested feature
Sometimes previous tools, e.g., OCR libraries output incorrectly encoded HTML. Because of visual similarity, for example, some undesired and incorrect character like https://www.compart.com/en/unicode/U+E157 is encoded, instead of https://www.compart.com/en/unicode/U+2630. Currently, when Docling parses an HTML document with such a character, it (or rather, BeautifulSoup) escapes these characters. For example, this heading item:
<h2 id="contents">Contents<a class="headerlink" href="#contents" title="Permanent link"></a></h2>
ends up with the .text value:
'Contents\ue157'
I have not found a straightforward way to control this behavior from within Docling or BeautifulSoup.
Alternatives
I have not found a robust and direct method to process these escapes from within Python. String substitution tricks are possible but at a performance cost.
@sanmai-NL I am not entirely sure what your request is. The escaped unicode in the string representation will actually print as a symbol, such as in:
> s = 'Contents\ue157'
> print(s)
Contents
How it prints depends on the interpreter.
It's a character we don't want. It's a data quality issue.
We have a custom cleanup function now that filters based on Unicode General Category. This character makes no sense in document text. To reiterate, what we request is a way to control which characters end up in Docling text nodes.
I'm a bit confused by the conversation of this issue. I see two possible interpretation:
- The character which is read by the html backend is wrong, i.e. "https://www.compart.com/en/unicode/U+E157 is encoded, instead of https://www.compart.com/en/unicode/U+2630"
- There seem to be a wish of skipping those "anchor / permalink / icon" html components
If we are talking about 1, I think it could be an encoding bug (to be verified). If we are talking about 2, it would require a design for custom logic in parsing html pages, which, unfortunately, seems very specific to the actual page.
It's about 1.
I have a similar issue where the exported markdown includes encoded symbols such as & and /ff istead of & and ff. Is there a way to toggle the decoding of html like symbols?