unstructured bug/Some images were not loaded. Check that poppler is installed and in your $PATH.

Describe the bug When I try to partition the PDF file using partition_pdf, it gives me the two error message given below -

Some images were not loaded. Check that poppler is installed and in your $PATH.
Some images were not loaded. Number of extracted images (487) does not match number of extracted page layouts (511)

To Reproduce It is coming using the pdf, I have attached in this card. Just need to pass PDF in below function -

raw_pdf_elements = partition_pdf(
    filename=path,
    # Unstructured first finds embedded image blocks
    extract_images_in_pdf=False,
    # Use layout model (YOLOX) to get bounding boxes (for tables) and find titles
    # Titles are any sub-section of the document
    infer_table_structure=True,
    # Post processing to aggregate text once we have the title
    chunking_strategy="by_title",
    # Chunking params to aggregate text blocks
    # Attempt to create a new chunk 3800 chars
    # Attempt to keep chunks > 2000 chars
    max_characters=4000,
    new_after_n_chars=3800,
    combine_text_under_n_chars=2000,
    # image_output_dir_path=path,
)

Expected behavior It should provide the all extracted elements from the PDF.

Screenshots If applicable, add screenshots to help explain your problem.

Environment Info Using Ubuntu 20.04

Additional context I have already installed required dependency mentioned in README file like -

apt-get install poppler-utils apt-get install tesseract-ocr pip install unstructured[all-docs] unstructured-inference pip install langchain pydantic lxml # if needed

But still getting the error. Uploading Jyoti-CNC-Automation-Limited-RHP.pdf…

Mar 14 '24 09:03 eci-aashish

Hi @eci-aashish

I tried to reproduce this error, but I was unable to download the attached PDF, can you please share it again?

Apr 10 '24 18:04 christinestraub

I am getting the same error and the warning message : WARNING: This function will be deprecated in a future release and unstructured will simply use the DEFAULT_MODEL from unstructured_inference.model.base to set the default model name

PDFInfoNotInstalledError: Unable to get page count. Is poppler installed and in PATH? attached the Test file Test.pdf

The API is hosted on Azure.

Below is the code : from IPython.display import JSON

import json, os

from unstructured_client import UnstructuredClient from unstructured_client.models import shared from unstructured_client.models.errors import SDKError

from unstructured.partition.pdf import partition_pdf from unstructured.staging.base import dict_to_elements, elements_to_json

s = UnstructuredClient( server_url="https://xx.xx.xx", api_key_auth= "" )

filename = "..\sources\pdf\docs\Test.pdf"

elements = partition_pdf(filename=filename, infer_table_structure=True, strategy='hi_res', )

tables = [el for el in elements if el.category == "Table"]

print(tables[0].text) print(tables[0].metadata.text_as_html)

May 06 '24 21:05 mindful-time

@mindful-time please open a separate issue for your issue and I'll be happy to help you with it.

Closing as inactive.

May 07 '24 16:05 scanny