No nodes are extracted from some PDFs

Open faileon opened this issue 1 year ago • 0 comments

Initial Checks

[X] I confirm that I'm on the latest version

Description

I've noticed that when I split my PDF via Firefox to have a smaller PDF (e.g. first 10 pages), openparse wont extract any nodes. Original PDF gets extracted fine.

When I specify table_args, it will make parser return some nodes, but all are identified as a table.

I am attaching the PDF, perhaps someone could have a look what's wrong. concept-vp4360-cz.pdf

Example Code

No response

Python, open-parse & OS Version

python_version: 3.12.7
operating_system: Linux
os_version: 6.11.8-arch1-2
open-parse version: 0.7.0
python version: 3.12.7 (main, Oct  1 2024, 11:15:50) [GCC 14.2.1 20240910]
platform: Linux-6.11.8-arch1-2-x86_64-with-glibc2.40
related packages: torchvision-0.20.1 tokenizers-0.20.3 torch-2.5.1 pydantic-2.9.2 PyMuPDF-1.24.13 transformers-4.46.2

Nov 16 '24 21:11 faileon