Enhance PDF parsing capabilities
Description
Using the enhanced open parsing library to parse PDF documents can maintain the table style of PDF documents, and can also extract the image content in the PDF while ensuring that the user's reading order does not change
Fixes # (issue)
Type of Change
Please delete options that are not relevant.
- [ ] Breaking change (fix or feature that would cause existing functionality to not work as expected)
How Has This Been Tested?
Please describe the tests that you ran to verify your changes. Provide instructions so we can reproduce. Please also list any relevant details for your test configuration
To test the PDF parsing function individually, use the following code
from core.rag.extractor.pdf.openparse import DocumentParser, processing
from core.rag.extractor.pdf.openparse.schemas import ImageElement
if __name__ == '__main__':
"""Lazy load given path as pages."""
# blob = Blob.from_path(self._file_path)
# yield from self.parse(blob)
file_path = "pdf file path here"
documents = []
parser = DocumentParser(
processing_pipeline=processing.BasicIngestionPipeline(),
table_args={
"parsing_algorithm": "pymupdf",
"table_output_format": "markdown"
}
)
parsed_basic_doc = parser.parse(file_path)
documentContent = ''
for _index, node in enumerate(parsed_basic_doc.nodes):
metadata = {"source": file_path, "page": _index}
for element in node.elements:
if isinstance(element, ImageElement):
# pdf images a
pass
else:
print(element.text)
there is ImageElement\ TableElement\TextElement If you parse the image, the focus is ImageElement, which has the following attributes that can be used block: dict text: str image: bytes ext: str bbox: Bbox
To test the overall effect, please follow the complete PDF upload and parsing process.
Suggested Checklist:
- [ ] normal PDF
- [ ] PDF document containing tables
- [ ] If possible, you can also detect PDF documents with images