unstructured icon indicating copy to clipboard operation
unstructured copied to clipboard

Add indentation level as a metadata field

Open MthwRobinson opened this issue 2 years ago • 3 comments

The goal of this issue is to add indentation level as a metadata field. For documents such as PDFs, we can detect this based on the position of the bounding boxes. For MS office documents this may be available as an attribute with the document xml.

MthwRobinson avatar Jun 09 '23 19:06 MthwRobinson

This would be a huge help in terms of inferring the structure of documents and matching the Unstructured abstractions/ontology to a more domain specific ontology that might be needed.

It would be great to be expose a threshold setting instructing the partitioning to use more aggressive vs less aggressive grouping of x coordinates. We're currently trying to accomplish this by rounding the top-left x coordinates to the nearest integer or up to base 5 or 10. This tend to work well for docs with a simple structure, but having an estimated hierarchy level as part of the partition result would be particularly helpful for more complex documents.

image

bcartier avatar Jun 15 '23 12:06 bcartier

This is in progress with #1433

newelh avatar Sep 25 '23 13:09 newelh

Older than 180 days old but keeping active due to recent reference in #2428

orlandounstructured avatar Feb 08 '24 22:02 orlandounstructured