Not able to fetch all text data & Not able to extract text, table data in proper format

Open reema93jain opened this issue 2 years ago • 1 comments

Hi Team,

I am using layout parser & detectron2 to detect everything i.e. text, tables, title, list but not figures from the pdf(which I converted into image using pdf2image). I wanted to then extract the detected text, title, table, list in .txt format

Issues: 1)It seems like model is not recognizing all of text data properly 2) While extracting data in .txt format , it appears that: a)I am not bale to print text data in sequence as it appears on pdf b) I am not able to extract table data in tabular format

Can you please suggest how I can resolve above issues? Thank you!

Code: Install necessary libraries #install detectron2: !pip install 'git+https://github.com/facebookresearch/[email protected]#egg=detectron2' #install layoutparser !pip install layoutparser !pip install layoutparser[ocr] ##install opencv, numpy, matplotlib !pip install opencv-python numpy matplotlib !pip3 install pdf2image !sudo apt install build-essential libpoppler-cpp-dev pkg-config python3-dev !apt-get install poppler-utils !pip install --upgrade google-cloud-vision !pip uninstall google-cloud-vision !pip install google-cloud-vision !apt install tesseract-ocr !apt install libtesseract-dev !pip install pytesseract

import os from pdf2image import convert_from_path import shutil import cv2 import numpy as np import layoutparser as lp from pdf2image import convert_from_path

Define Pdf_path

pdf_file='7050X_Q_A.pdf'

Define your output file name here

output_file = 'output.txt'

with open(output_file, 'w', encoding='utf-8') as f: for i, page_img in enumerate(convert_from_path(pdf_file)): img = np.asarray(page_img)

    model3 = lp.models.Detectron2LayoutModel(
        'lp://PubLayNet/faster_rcnn_R_50_FPN_3x/config',
        extra_config=["MODEL.ROI_HEADS.SCORE_THRESH_TEST", 0.5],
        label_map={0: "Text", 1: "Title", 2: "List", 3: "Table", 4: "Figure"}
    )

    layout_result3 = model3.detect(img)

    text_blocks = lp.Layout([b for b in layout_result3 if b.type != "Figure"])

    h, w = img.shape[:2]

    left_interval = lp.Interval(0, w / 2 * 1.05, axis='x').put_on_canvas(img)

    left_blocks = text_blocks.filter_by(left_interval, center=True)
    left_blocks.sort(key=lambda b: b.coordinates[1])

    right_blocks = [b for b in text_blocks if b not in left_blocks]
    right_blocks.sort(key=lambda b: b.coordinates[1])

    text_blocks = lp.Layout([b.set(id=idx) for idx, b in enumerate(left_blocks + right_blocks)])
    viz=lp.draw_box(img, text_blocks,box_width=10,show_element_id=True)
    display(viz)
    ocr_agent = lp.TesseractAgent(languages='eng')
    for block in text_blocks:
           segment_image = (block
                            .pad(left=5, right=5, top=5, bottom=5)
                            .crop_image(img))

           text = ocr_agent.detect(segment_image)
           block.set(text=text, inplace=True)

        # Write text to the output file
    for txt in text_blocks.get_texts():
        #print(txt, end='\n---\n')
        f.write(txt + '\n---\n')

print("Text extraction completed. Check the output file:", output_file)

Environment

Windows
Layout Parser & layoutparser[ocr] version 0.3.4
PyTorch version: 2.1.0+cu121
!pip install torch==2.1.0 torchvision==0.16.0 torchaudio==2.1.0 --index-url https://download.pytorch.org/whl/cu121
google-cloud-vision-3.5.0
google-api-core Version: 2.11.1 6.Python 3.10.6

Thanks Reema Jain

Jan 31 '24 06:01 reema93jain

Hi Team,

Can someone please help on resolving above issue?

Thank you for the help! Reema Jain

Feb 01 '24 06:02 reema93jain