Enhance PDF parsing capabilities

Open ic-xu opened this issue 1 year ago • 0 comments

Description

Using the enhanced open parsing library to parse PDF documents can maintain the table style of PDF documents, and can also extract the image content in the PDF while ensuring that the user's reading order does not change

Fixes # (issue)

Type of Change

Please delete options that are not relevant.

[ ] Breaking change (fix or feature that would cause existing functionality to not work as expected)

How Has This Been Tested?

Please describe the tests that you ran to verify your changes. Provide instructions so we can reproduce. Please also list any relevant details for your test configuration

To test the PDF parsing function individually, use the following code

from core.rag.extractor.pdf.openparse import DocumentParser, processing
from core.rag.extractor.pdf.openparse.schemas import ImageElement

if __name__ == '__main__':
    """Lazy load given path as pages."""
    # blob = Blob.from_path(self._file_path)
    # yield from self.parse(blob)
    file_path = "pdf file path here"
    documents = []
    parser = DocumentParser(
        processing_pipeline=processing.BasicIngestionPipeline(),
        table_args={
            "parsing_algorithm": "pymupdf",
            "table_output_format": "markdown"
        }
    )
    parsed_basic_doc = parser.parse(file_path)
    documentContent = ''
    for _index, node in enumerate(parsed_basic_doc.nodes):
        metadata = {"source": file_path, "page": _index}
        for element in node.elements:
            if isinstance(element, ImageElement):
                # pdf images a
                pass
            else:
                print(element.text)

there is ImageElement\ TableElement\TextElement If you parse the image, the focus is ImageElement， which has the following attributes that can be used block: dict text: str image: bytes ext: str bbox: Bbox

To test the overall effect, please follow the complete PDF upload and parsing process.

Suggested Checklist:

[ ] normal PDF
[ ] PDF document containing tables
[ ] If possible, you can also detect PDF documents with images

Apr 24 '24 09:04 ic-xu