open-parse icon indicating copy to clipboard operation
open-parse copied to clipboard

Improved file parsing for LLM’s

Results 33 open-parse issues
Sort by recently updated
recently updated
newest added

I tried to parse: https://www.pzu.pl/_fileserver/item/1540593 ``` import openparse from pprint import pprint doc_path = 'data/OWU_szpit.pdf' parser = openparse.DocumentParser() parsed_doc = parser.parse(doc_path) pprint(parsed_doc.model_dump()) ``` The saved output does not contain for...

### Initial Checks - [X] I confirm that I'm on the latest version ### Description [example1.pdf](https://github.com/user-attachments/files/16424947/example1.pdf) [example2.pdf](https://github.com/user-attachments/files/16424951/example2.pdf) ### Example Code ```Python import openparse from openparse import DocumentParser from IPython.display import...

bug

### Description There is another tool for PDF table extraction recently, maybe this could be an option to embed? https://github.com/ai8hyf/TF-ID

### Description Love the project, we need to add a langchain Document interface, which I am more than happy to do it but just a few questions: - each node...

### Description fine-tune or train the model on the scientific formulas. it will easily understand the scientific sign and parse it accurately.

### Description It would be great to have, in addition to the `to_llama_index_nodes` method to have a `to_llama_index_document` method on the `openparse.schemas.ParsedDocument` class that returns a valid `llama_index.core.schema.Document` object.

as top-left origin system is returned and flip coordinates is handled both in sorting and draw_bboxes

### Initial Checks - [X] I confirm that I'm on the latest version ### Description I've run into issues parsing some PDFs from the US House. For example: https://aderholt.house.gov/sites/evo-subsites/aderholt.house.gov/files/evo-media-document/aderholt-challenger-center-disclosure-ltr-updated.pdf With...

bug