open-parse issues

3

I tried to parse: https://www.pzu.pl/_fileserver/item/1540593 ``` import openparse from pprint import pprint doc_path = 'data/OWU_szpit.pdf' parser = openparse.DocumentParser() parsed_doc = parser.parse(doc_path) pprint(parsed_doc.model_dump()) ``` The saved output does not contain for...

zby

Some PDF documents cannot be parsed

3

### Initial Checks - [X] I confirm that I'm on the latest version ### Description [example1.pdf](https://github.com/user-attachments/files/16424947/example1.pdf) [example2.pdf](https://github.com/user-attachments/files/16424951/example2.pdf) ### Example Code ```Python import openparse from openparse import DocumentParser from IPython.display import...

tiamjiakun

bug

Table Extraction Tool

1

### Description There is another tool for PDF table extraction recently, maybe this could be an option to embed? https://github.com/ai8hyf/TF-ID

xyzdeclan

add langchain document support

3

### Description Love the project, we need to add a langchain Document interface, which I am more than happy to do it but just a few questions: - each node...

priamai

scientific formula capturing

### Description fine-tune or train the model on the scientific formulas. it will easily understand the scientific sign and parse it accurately.

zabih1

Method to convert `ParsedDocument` object to LlamaIndex `Document` object

1

### Description It would be great to have, in addition to the `to_llama_index_nodes` method to have a `to_llama_index_document` method on the `openparse.schemas.ParsedDocument` class that returns a valid `llama_index.core.schema.Document` object.

mjspeck

Better table detection

jain-prach

flipping coordinates was removed for pymupdf

as top-left origin system is returned and flip coordinates is handled both in sorting and draw_bboxes

jain-prach

PIL.UnidentifiedImageError

6

### Initial Checks - [X] I confirm that I'm on the latest version ### Description I've run into issues parsing some PDFs from the US House. For example: https://aderholt.house.gov/sites/evo-subsites/aderholt.house.gov/files/evo-media-document/aderholt-challenger-center-disclosure-ltr-updated.pdf With...

thoppe

bug

open-parse
open-parse copied to clipboard

Metadata

More Embedding Models [Draft]

Missing parts of documents

Some PDF documents cannot be parsed

Table Extraction Tool

add langchain document support

scientific formula capturing

Method to convert `ParsedDocument` object to LlamaIndex `Document` object

Better table detection

flipping coordinates was removed for pymupdf

PIL.UnidentifiedImageError

← Metadata

Owner

Metadata

open-parse open-parse copied to clipboard

Metadata

← Metadata

Owner

Metadata

open-parse
open-parse copied to clipboard