Fix: handle non-UTF-8 input in util.py
Chemistry papers such as DOIs 10.26434/chemrxiv-2025-1x058-v2 and 10.26434/chemrxiv-2025-3lwn2 contain files that sometimes throw errors due to non-UTF-8 characters during parsing. This change updates the MD5 hashing line to replace problematic characters during UTF-8 encoding, making the parsing process more robust and preventing interruptions caused by encoding errors.
Closes https://github.com/Future-House/paper-qa/issues/1068
@dmcgrath19 - why would it be necessary to replace the characters? Shouldn't we just ignore? This code only computes the hash - it has no effect on the actual usage of the parsed paper.
@whitead My reasoning for replacing rather than ignoring invalid characters is that the hash serves as a unique identifier. If invalid characters are ignored and dropped silently, different inputs with subtle encoding differences could produce identical hashes, causing issues with deduplication and caching. By replacing invalid characters, we ensure the hash reflects all differences in the input data, preserving uniqueness and making encoding problems visible.
TLDR: it's more robust but could be seen as a bit messier.
Hi @dmcgrath19 I was trying to repro this issue just so I can make a unit test. This is not hitting the UTF errors, what am I missing?
import pytest
from paperqa import Docs
from paperqa.agents import SearchIndex
@pytest.mark.asyncio
async def test_pdf() -> None:
docs = Docs()
docname = await docs.aadd_url(
# URL from the download PDF button from:
# - 10.26434/chemrxiv-2025-3lwn2: https://chemrxiv.org/engage/chemrxiv/article-details/687f4677728bf9025ea3067a
# - 10.26434/chemrxiv-2025-1x058-v2: https://chemrxiv.org/engage/chemrxiv/article-details/6835abf53ba0887c33410d6d
"..."
)
(doc_details,) = (d for d in docs.docs.values() if d.docname == docname)
index_doc = {
"title": doc_details.title,
"file_location": "all-atom",
"body": "".join(t.text for t in docs.texts if t.doc == doc_details),
}
search_index = SearchIndex(fields=[*SearchIndex.REQUIRED_FIELDS, "title"])
await search_index.add_document(index_doc, document=docs)
Thanks for digging into this.
In my case, I batch-downloaded some PDFs from ChemRxiv into a local folder and pointed paperqa at that directory. Not all of the files triggered the error. When I inspected the local copies, they didn’t appear corrupted or mis-encoded, so my guess is that the issue could be related to how some of the PDFs are encoded by ChemRxiv during batch-downloading.
In my case, I batch-downloaded some PDFs from ChemRxiv into a local folder and pointed paperqa at that directory. Not all of the files triggered the error. When I inspected the local copies, they didn’t appear corrupted or mis-encoded, so my guess is that the issue could be related to how some of the PDFs are encoded by ChemRxiv during batch-downloading.
Ah that makes sense, thanks for clarifying that.
Did you actually test this change out yet? I made the below unit test for a bad PDF in test_agents.py:
import shutil
import tempfile
from pathlib import Path
import pytest
from paperqa import DocDetails, Docs
from paperqa.agents import SearchIndex
@pytest.mark.asyncio
async def test_search_corrupt_pdf(stub_data_dir: Path) -> None:
"""Test that a slightly flawed PDF can still work."""
with tempfile.TemporaryDirectory() as td:
tmp_pdf = Path(td) / "paper.pdf"
shutil.copy(stub_data_dir / "paper.pdf", tmp_pdf)
# Create standard Docs object with the PDF we will shortly make corrupt
docs = Docs()
docname = await docs.aadd(tmp_pdf)
(doc_details,) = (d for d in docs.docs.values() if d.docname == docname)
assert isinstance(doc_details, DocDetails)
assert doc_details.title, "This tests's index doc requires a populated title"
assert doc_details.year, "This tests's index doc requires a populated year"
# Now, let's make a corrupt PDF body
body = "".join(t.text for t in docs.texts if t.doc == doc_details)
# body += "\udcff" # Inject unpaired low surrogate
body = "\udcff" # Keep exception short
# Confirm we can both index and query this document
search_index = SearchIndex(
fields=[*SearchIndex.REQUIRED_FIELDS, "title", "year"]
)
index_doc = {
"title": doc_details.title,
"year": doc_details.year,
"file_location": str(tmp_pdf.absolute()),
"body": body,
}
await search_index.add_document(index_doc, document=docs)
results = await search_index.query("XAI", keep_filenames=True)
assert {(r[0].id, Path(r[1]).name) for r in results} == {(docs.id, "paper.pdf")}
Running that with this change, I am seeing:
-
paperqa.utils.hexdigest(body)now succeeds (which makes sense due toerrors="replace") - Immediately after,
tantivy.Document.from_dict(index_doc)fails withValueError: Value unsupported '\udcff'
So this PR is an improvement (we get further), but I think one would still get blown up. Can you confirm with this change that you would still be unable to index your ChemRxiv folder, now due to tantivy.Document.from_dict?
And if the index build actually succeeds, can you try to improve this unit test and add it to this PR? Also get pre-commit passing and rebase atop main branch