PyMuPDF icon indicating copy to clipboard operation
PyMuPDF copied to clipboard

Documentation: File content supersedes file extension parameter on open

Open futuremojo opened this issue 9 months ago • 5 comments

Description of the bug

Not sure if this is documented behavior, but couldn't find it.

To recreate:

  1. Take a PNG file and rename it with a PDF extension.
  2. Set the file name on line 9 in the following code.
import pymupdf
import io
import logging

# Configure basic logging to see potential warnings/errors from pymupdf
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

image_filename = ""  # <--- CHANGE THIS to your actual image file name (it should have a PDF extension)

# --- Method 1: Read all bytes first ---
print(f"--- Testing with raw bytes from {image_filename} ---")
try:
    with open(image_filename, "rb") as f:
        image_data = f.read()
        print(f"Read {len(image_data)} bytes from image file.")

    doc_from_bytes = None
    try:
        # Simulate passing the raw byte stream
        doc_from_bytes = pymupdf.open(stream=image_data, filetype="pdf")
        print(f"SUCCESS: Opened stream from raw bytes.")
        print(f"Page count: {doc_from_bytes.page_count}")
        # Try getting text (might return empty or error)
        try:
            text = [page.get_text() for page in doc_from_bytes]
            print(f"Extracted text (list of strings per page): {text}")
            print(f"any(text): {any(text)}")
        except Exception as text_e:
            print(f"Error getting text: {text_e}")
    except Exception as e:
        print(f"FAILED to open stream from raw bytes: {e}")
    finally:
        if doc_from_bytes:
            doc_from_bytes.close()
except FileNotFoundError:
    print(f"ERROR: File '{image_filename}' not found. Please create it.")
except Exception as file_e:
    print(f"ERROR reading file: {file_e}")

print("\n")


# --- Method 2: Use io.BytesIO to create a file-like stream object ---
print(f"--- Testing with io.BytesIO stream from {image_filename} ---")
try:
    with open(image_filename, "rb") as f:
        image_data_for_io = f.read()
        print(f"Read {len(image_data_for_io)} bytes for io.BytesIO.")

    image_stream = io.BytesIO(image_data_for_io)
    doc_from_io = None
    try:
        # Pass the BytesIO stream object
        doc_from_io = pymupdf.open(stream=image_stream, filetype="pdf")
        print(f"SUCCESS: Opened stream from io.BytesIO.")
        print(f"Page count: {doc_from_io.page_count}")
        # Try getting text
        try:
            text_io = [page.get_text() for page in doc_from_io]
            print(f"Extracted text (list of strings per page): {text_io}")
            print(f"any(text_io): {any(text_io)}")
        except Exception as text_io_e:
            print(f"Error getting text: {text_io_e}")
    except Exception as e:
        print(f"FAILED to open stream from io.BytesIO: {e}")
    finally:
        if doc_from_io:
            doc_from_io.close()
        # No need to close BytesIO explicitly here usually
except FileNotFoundError:
    print(f"ERROR: File '{image_filename}' not found. Please create it.")
except Exception as file_io_e:
    print(f"ERROR reading file: {file_io_e}")

print("\n")

# --- Method 3: Control test - Open via file path (should fail) ---
print(f"--- Testing with file path {image_filename} ---")
doc_from_path = None
try:
    doc_from_path = pymupdf.open(image_filename, filetype="pdf")
    print(f"UNEXPECTED SUCCESS: Opened file path.")
    print(f"Page count: {doc_from_path.page_count}")
except Exception as e:
    print(f"EXPECTED FAILURE opening file path: {e}")  # Expecting FileDataError here
finally:
    if doc_from_path:
        doc_from_path.close()

print("\nTest complete.")

Expectation: for methods 1 and 2 to throw an exception.

How to reproduce the bug

Output from code snippet ("fib_tool_1.pdf" is actually a PNG):

--- Testing with raw bytes from fib_tool_1.pdf ---
Read 181663 bytes from image file.
SUCCESS: Opened stream from raw bytes.
Page count: 1
Extracted text (list of strings per page): ['']
any(text): False


--- Testing with io.BytesIO stream from fib_tool_1.pdf ---
Read 181663 bytes for io.BytesIO.
SUCCESS: Opened stream from io.BytesIO.
Page count: 1
Extracted text (list of strings per page): ['']
any(text_io): False


--- Testing with file path fib_tool_1.pdf ---
EXPECTED FAILURE opening file path: Failed to open file 'fib_tool_1.pdf' as type 'pdf'.

Test complete.

PyMuPDF version

1.25.5

Operating system

MacOS

Python version

3.12

futuremojo avatar Apr 14 '25 15:04 futuremojo

This is no real issue but a feature: Our base library has improved its document handling capabilities and introduced a filetype "sniffer" algorithm which does look at the actual file content. Based on its findings it automatically takes the right action - i.e. opens the file with the correct document handler.

So, taking your example, the sniffer detects that an image file is being presented and creates the matching Document object ... irrespective of what you have passed as file extension or mime-type:

Image

This also works the other way round:

Image

We will adjust our documentation accordingly. We are still investigating whether we should also issue warnings in cases like the above.

JorjMcKie avatar Apr 14 '25 23:04 JorjMcKie

@JorjMcKie I see. So in examples like this where I want to make sure the user passed in a PDF, you would advise checking the doc.is_pdf flag, correct??

futuremojo avatar Apr 15 '25 00:04 futuremojo

So in examples like this where I want to make sure the user passed in a PDF, you would advise checking the doc.is_pdf flag, correct??

Correct! All document types reveal themselves through the value of the dictionary key "format" in doc.metadata. For example doc.metadata["format"] = "XPS". Snippet:

import pymupdf
import pathlib
from pprint import pp

data = pathlib.Path("Acronis.xps").read_bytes()
doc = pymupdf.open(filetype="png", stream=data)
pp(doc.metadata)

Gives this:

{'format': 'XPS',
 'title': '',
 'author': '',
 'subject': '',
 'keywords': '',
 'creator': '',
 'producer': '',
 'creationDate': '',
 'modDate': '',
 'trapped': '',
 'encryption': ''}

So arguably, the "filetype" parameter is now obsolete.

JorjMcKie avatar Apr 15 '25 08:04 JorjMcKie

With your permission, I would like to reformulate the issue title to something like "Documentation: File content supersedes file extension parameter on open".

JorjMcKie avatar Apr 15 '25 08:04 JorjMcKie

Yes, please go ahead. And thank you for the quick and helpful reply. I will leave the issue for you to do as you wish.

futuremojo avatar Apr 15 '25 10:04 futuremojo

The documentation at https://pymupdf.readthedocs.io/en/latest/ was updated a while ago to match the new behaviour, so closing this issue now.