unstructured icon indicating copy to clipboard operation
unstructured copied to clipboard

Feature/remove unnessary re for table ele in pdf

Open JIAQIA opened this issue 10 months ago • 2 comments

RE_MULTISPACE_INCLUDING_NEWLINES is applied to all elements of the Text category after partitioning PDF files. The relevant code is shown below:

out_elements = []
for el in elements:
    if isinstance(el, PageBreak) and not include_page_breaks:
        continue

    if isinstance(el, Image):
        out_elements.append(cast(Element, el))
    # NOTE(crag): this is probably always a Text object, but check for the sake of typing
    elif isinstance(el, Text):
        el.text = re.sub(
            RE_MULTISPACE_INCLUDING_NEWLINES,
            " ",
            el.text or "",
        ).strip()
        if el.text or isinstance(el, PageBreak):
            out_elements.append(cast(Element, el))

Newlines will not be removed from Table or TableChunk elements now.

JIAQIA avatar Apr 09 '25 11:04 JIAQIA

This issue has a significant impact on our datasets because many of our image-based PDF files contain tabular data. Our OCR system extracts this information in Markdown format, where newline characters ("\n") are essential for preserving the table structure.

JIAQIA avatar Apr 09 '25 11:04 JIAQIA

As written, when an element is an instance of Table or TableChunk it won't be added to out_elements at all, which doesn't seem like desired behavior.

qued avatar Jul 01 '25 14:07 qued