Add indentation level as a metadata field
The goal of this issue is to add indentation level as a metadata field. For documents such as PDFs, we can detect this based on the position of the bounding boxes. For MS office documents this may be available as an attribute with the document xml.
This would be a huge help in terms of inferring the structure of documents and matching the Unstructured abstractions/ontology to a more domain specific ontology that might be needed.
It would be great to be expose a threshold setting instructing the partitioning to use more aggressive vs less aggressive grouping of x coordinates. We're currently trying to accomplish this by rounding the top-left x coordinates to the nearest integer or up to base 5 or 10. This tend to work well for docs with a simple structure, but having an estimated hierarchy level as part of the partition result would be particularly helpful for more complex documents.
This is in progress with #1433
Older than 180 days old but keeping active due to recent reference in #2428