Improper results on scanned pdfs
I have been trying to analyze the documents using layout parser on different types of documents, I am able to get expected results on True pdfs but not on scanned pdfs, it is detecting the scanned pdf image contents as figure or not as expected results.
I am facing this issue only for the scanned pdfs
Checklist
- I have searched related issues but cannot get the expected help.
- The bug has not been fixed in the latest version, see the Layout Parser Releases
To Reproduce
import layoutparser as lp import cv2
image = cv2.imread("test.png") image = image[..., ::-1]
model = lp.models.Detectron2LayoutModel('lp://PubLayNet/faster_rcnn_R_50_FPN_3x/config', extra_config=["MODEL.ROI_HEADS.SCORE_THRESH_TEST", 0.8], label_map={0: "Text", 1: "Title", 2: "List", 3:"Table", 4:"Figure"})
color_map = { 'Text': 'red', 'Title': 'blue', 'List': 'green', 'Table': 'purple', 'Figure': 'pink', }
layout = model.detect(image)
lp.draw_box(image, layout, box_width=3,color_map=color_map)
Environment
- I am using windows
- Latest layout parser version
Contains 2 images:
1: Scanned pdf image result
2: Proper pdf image result
Have you tried correcting the scanned images to make the background plain white? Here's a robust looking example using opencv:
https://www.freedomvc.com/index.php/2022/01/17/basic-background-remover-with-opencv/