bug/Some images were not loaded. Check that poppler is installed and in your $PATH.
Describe the bug When I try to partition the PDF file using partition_pdf, it gives me the two error message given below -
- Some images were not loaded. Check that poppler is installed and in your $PATH.
- Some images were not loaded. Number of extracted images (487) does not match number of extracted page layouts (511)
To Reproduce It is coming using the pdf, I have attached in this card. Just need to pass PDF in below function -
raw_pdf_elements = partition_pdf(
filename=path,
# Unstructured first finds embedded image blocks
extract_images_in_pdf=False,
# Use layout model (YOLOX) to get bounding boxes (for tables) and find titles
# Titles are any sub-section of the document
infer_table_structure=True,
# Post processing to aggregate text once we have the title
chunking_strategy="by_title",
# Chunking params to aggregate text blocks
# Attempt to create a new chunk 3800 chars
# Attempt to keep chunks > 2000 chars
max_characters=4000,
new_after_n_chars=3800,
combine_text_under_n_chars=2000,
# image_output_dir_path=path,
)
Expected behavior It should provide the all extracted elements from the PDF.
Screenshots If applicable, add screenshots to help explain your problem.
Environment Info Using Ubuntu 20.04
Additional context I have already installed required dependency mentioned in README file like -
apt-get install poppler-utils apt-get install tesseract-ocr pip install unstructured[all-docs] unstructured-inference pip install langchain pydantic lxml # if needed
But still getting the error. Uploading Jyoti-CNC-Automation-Limited-RHP.pdf…
Hi @eci-aashish
I tried to reproduce this error, but I was unable to download the attached PDF, can you please share it again?
I am getting the same error and the warning message :
WARNING: This function will be deprecated in a future release and unstructured will simply use the DEFAULT_MODEL from unstructured_inference.model.base to set the default model name
PDFInfoNotInstalledError: Unable to get page count. Is poppler installed and in PATH? attached the Test file Test.pdf
The API is hosted on Azure.
Below is the code : from IPython.display import JSON
import json, os
from unstructured_client import UnstructuredClient from unstructured_client.models import shared from unstructured_client.models.errors import SDKError
from unstructured.partition.pdf import partition_pdf from unstructured.staging.base import dict_to_elements, elements_to_json
s = UnstructuredClient( server_url="https://xx.xx.xx", api_key_auth= "" )
filename = "..\sources\pdf\docs\Test.pdf"
elements = partition_pdf(filename=filename, infer_table_structure=True, strategy='hi_res', )
tables = [el for el in elements if el.category == "Table"]
print(tables[0].text) print(tables[0].metadata.text_as_html)
@mindful-time please open a separate issue for your issue and I'll be happy to help you with it.
Closing as inactive.