unstructured Feature/remove unnessary re for table ele in pdf

RE_MULTISPACE_INCLUDING_NEWLINES is applied to all elements of the Text category after partitioning PDF files. The relevant code is shown below:

out_elements = []
for el in elements:
    if isinstance(el, PageBreak) and not include_page_breaks:
        continue

    if isinstance(el, Image):
        out_elements.append(cast(Element, el))
    # NOTE(crag): this is probably always a Text object, but check for the sake of typing
    elif isinstance(el, Text):
        el.text = re.sub(
            RE_MULTISPACE_INCLUDING_NEWLINES,
            " ",
            el.text or "",
        ).strip()
        if el.text or isinstance(el, PageBreak):
            out_elements.append(cast(Element, el))

Newlines will not be removed from Table or TableChunk elements now.

Apr 09 '25 11:04 JIAQIA

This issue has a significant impact on our datasets because many of our image-based PDF files contain tabular data. Our OCR system extracts this information in Markdown format, where newline characters ("\n") are essential for preserving the table structure.

Apr 09 '25 11:04 JIAQIA

As written, when an element is an instance of Table or TableChunk it won't be added to out_elements at all, which doesn't seem like desired behavior.

Jul 01 '25 14:07 qued