unstructured
unstructured copied to clipboard
Feature/remove unnessary re for table ele in pdf
RE_MULTISPACE_INCLUDING_NEWLINES is applied to all elements of the Text category after partitioning PDF files. The relevant code is shown below:
out_elements = []
for el in elements:
if isinstance(el, PageBreak) and not include_page_breaks:
continue
if isinstance(el, Image):
out_elements.append(cast(Element, el))
# NOTE(crag): this is probably always a Text object, but check for the sake of typing
elif isinstance(el, Text):
el.text = re.sub(
RE_MULTISPACE_INCLUDING_NEWLINES,
" ",
el.text or "",
).strip()
if el.text or isinstance(el, PageBreak):
out_elements.append(cast(Element, el))
Newlines will not be removed from Table or TableChunk elements now.
This issue has a significant impact on our datasets because many of our image-based PDF files contain tabular data. Our OCR system extracts this information in Markdown format, where newline characters ("\n") are essential for preserving the table structure.
As written, when an element is an instance of Table or TableChunk it won't be added to out_elements at all, which doesn't seem like desired behavior.