unstructured icon indicating copy to clipboard operation
unstructured copied to clipboard

bug/RE_MULTISPACE_INCLUDING_NEWLINES was incorrectly used for Table or TableChunk

Open JIAQIA opened this issue 10 months ago • 0 comments

🐞 Describe the bug

RE_MULTISPACE_INCLUDING_NEWLINES is applied to all elements of the Text category after partitioning PDF files. The relevant code is shown below:

out_elements = []
for el in elements:
    if isinstance(el, PageBreak) and not include_page_breaks:
        continue

    if isinstance(el, Image):
        out_elements.append(cast(Element, el))
    # NOTE(crag): this is probably always a Text object, but check for the sake of typing
    elif isinstance(el, Text):
        el.text = re.sub(
            RE_MULTISPACE_INCLUDING_NEWLINES,
            " ",
            el.text or "",
        ).strip()
        if el.text or isinstance(el, PageBreak):
            out_elements.append(cast(Element, el))

File path: unstructured/partition/pdf.py

However, if the element is a Table or TableChunk, the newline character "\n" is important and should not be removed in this context.


🔁 To Reproduce

  1. Use partition_pdf on an image-based PDF that includes a table.
  2. Observe that the newline characters within table content are removed by the above code.

Expected behavior

Newlines should not be removed from Table or TableChunk elements.


🖼 Screenshots

If applicable, add screenshots to help illustrate the issue.


🧰 Environment Info

  • OS version: macOS 13.6.7 (arm64)
  • Python version: 3.10.14
  • unstructured version: None
  • unstructured-inference version: 0.7.36
  • pytesseract version: 0.3.10
  • Torch version: 2.3.0
  • Detectron2: Not installed
  • PaddleOCR: Not installed
  • Libmagic version: libmagic 5.46 (bottled)
  • LibreOffice version: 25.2.1

⚠️ Note: There were warnings about pip version checks failing.


📌 Additional context

The issue likely arises from applying the regex substitution to all Text elements indiscriminately, including those derived from tables, where "\n" conveys meaningful structure.

JIAQIA avatar Apr 09 '25 11:04 JIAQIA