paper-qa icon indicating copy to clipboard operation
paper-qa copied to clipboard

Fix: handle non-UTF-8 input in util.py

Open dmcgrath19 opened this issue 6 months ago • 5 comments

Chemistry papers such as DOIs 10.26434/chemrxiv-2025-1x058-v2 and 10.26434/chemrxiv-2025-3lwn2 contain files that sometimes throw errors due to non-UTF-8 characters during parsing. This change updates the MD5 hashing line to replace problematic characters during UTF-8 encoding, making the parsing process more robust and preventing interruptions caused by encoding errors.

Closes https://github.com/Future-House/paper-qa/issues/1068

dmcgrath19 avatar Aug 20 '25 16:08 dmcgrath19

@dmcgrath19 - why would it be necessary to replace the characters? Shouldn't we just ignore? This code only computes the hash - it has no effect on the actual usage of the parsed paper.

whitead avatar Aug 20 '25 20:08 whitead

@whitead My reasoning for replacing rather than ignoring invalid characters is that the hash serves as a unique identifier. If invalid characters are ignored and dropped silently, different inputs with subtle encoding differences could produce identical hashes, causing issues with deduplication and caching. By replacing invalid characters, we ensure the hash reflects all differences in the input data, preserving uniqueness and making encoding problems visible.

TLDR: it's more robust but could be seen as a bit messier.

dmcgrath19 avatar Aug 20 '25 21:08 dmcgrath19

Hi @dmcgrath19 I was trying to repro this issue just so I can make a unit test. This is not hitting the UTF errors, what am I missing?

import pytest

from paperqa import Docs
from paperqa.agents import SearchIndex


@pytest.mark.asyncio
async def test_pdf() -> None:
    docs = Docs()
    docname = await docs.aadd_url(
        # URL from the download PDF button from:
        # - 10.26434/chemrxiv-2025-3lwn2: https://chemrxiv.org/engage/chemrxiv/article-details/687f4677728bf9025ea3067a
        # - 10.26434/chemrxiv-2025-1x058-v2: https://chemrxiv.org/engage/chemrxiv/article-details/6835abf53ba0887c33410d6d
        "..."
    )
    (doc_details,) = (d for d in docs.docs.values() if d.docname == docname)
    index_doc = {
        "title": doc_details.title,
        "file_location": "all-atom",
        "body": "".join(t.text for t in docs.texts if t.doc == doc_details),
    }

    search_index = SearchIndex(fields=[*SearchIndex.REQUIRED_FIELDS, "title"])
    await search_index.add_document(index_doc, document=docs)

jamesbraza avatar Aug 21 '25 22:08 jamesbraza

Thanks for digging into this.

In my case, I batch-downloaded some PDFs from ChemRxiv into a local folder and pointed paperqa at that directory. Not all of the files triggered the error. When I inspected the local copies, they didn’t appear corrupted or mis-encoded, so my guess is that the issue could be related to how some of the PDFs are encoded by ChemRxiv during batch-downloading.

dmcgrath19 avatar Aug 25 '25 12:08 dmcgrath19

In my case, I batch-downloaded some PDFs from ChemRxiv into a local folder and pointed paperqa at that directory. Not all of the files triggered the error. When I inspected the local copies, they didn’t appear corrupted or mis-encoded, so my guess is that the issue could be related to how some of the PDFs are encoded by ChemRxiv during batch-downloading.

Ah that makes sense, thanks for clarifying that.


Did you actually test this change out yet? I made the below unit test for a bad PDF in test_agents.py:

import shutil
import tempfile
from pathlib import Path

import pytest

from paperqa import DocDetails, Docs
from paperqa.agents import SearchIndex


@pytest.mark.asyncio
async def test_search_corrupt_pdf(stub_data_dir: Path) -> None:
    """Test that a slightly flawed PDF can still work."""
    with tempfile.TemporaryDirectory() as td:
        tmp_pdf = Path(td) / "paper.pdf"
        shutil.copy(stub_data_dir / "paper.pdf", tmp_pdf)

        # Create standard Docs object with the PDF we will shortly make corrupt
        docs = Docs()
        docname = await docs.aadd(tmp_pdf)
        (doc_details,) = (d for d in docs.docs.values() if d.docname == docname)
        assert isinstance(doc_details, DocDetails)
        assert doc_details.title, "This tests's index doc requires a populated title"
        assert doc_details.year, "This tests's index doc requires a populated year"
        # Now, let's make a corrupt PDF body
        body = "".join(t.text for t in docs.texts if t.doc == doc_details)
        # body += "\udcff"  # Inject unpaired low surrogate
        body = "\udcff"  # Keep exception short

        # Confirm we can both index and query this document
        search_index = SearchIndex(
            fields=[*SearchIndex.REQUIRED_FIELDS, "title", "year"]
        )
        index_doc = {
            "title": doc_details.title,
            "year": doc_details.year,
            "file_location": str(tmp_pdf.absolute()),
            "body": body,
        }
        await search_index.add_document(index_doc, document=docs)
        results = await search_index.query("XAI", keep_filenames=True)
        assert {(r[0].id, Path(r[1]).name) for r in results} == {(docs.id, "paper.pdf")}

Running that with this change, I am seeing:

  1. paperqa.utils.hexdigest(body) now succeeds (which makes sense due to errors="replace")
  2. Immediately after, tantivy.Document.from_dict(index_doc) fails with ValueError: Value unsupported '\udcff'

So this PR is an improvement (we get further), but I think one would still get blown up. Can you confirm with this change that you would still be unable to index your ChemRxiv folder, now due to tantivy.Document.from_dict?

And if the index build actually succeeds, can you try to improve this unit test and add it to this PR? Also get pre-commit passing and rebase atop main branch

jamesbraza avatar Aug 25 '25 20:08 jamesbraza