Developer Experience: Element metadata side effects
Is your feature request related to a problem? Please describe.
There is a possibly undesired side-effect in how element metadata are processed. In unstructured/partition/common.py the _add_element_metadata method is invoked as part of add_metadata_with_filetype decorator for filetype partitioning. This method is intended to add additional information to the metadata generated with the element including filename and filetype, however the existing metadata is merged into a newly created metadata object rather than the other way around. Because of the way it's structured, new metadata fields can easily be forgotten and pose debugging challenges to developers.
Describe the solution you'd like All metadata fields created on the element should be carried over by default, while partition-specific metadata (filetype, filename, etc.) should still override.
Describe alternatives you've considered
Metadata can be carried over by explicitly adding it to the new ElementMetadata that's created in _add_element_metadata though for developers this can be difficult to identify as it's not explicitly documented.
Additional context Surfaced in #1268 when adding parent_id and category_depth fields to element metadata
@scanny - Do you have any thoughts on this?
I do. In my opinion we should incrementally reduce usage of this "helper" function replacing it with direct action on the metadata, then eliminate it if we don't find a compelling use-case for it.
I suspect the original use-case was something to do with converting LayoutElement objects (from inference) to Element objects. It's possible it still has a use for that. But it's complicated and most of what it does does not apply to Element objects, which for example have no ".links" attribute.
So I'd be inclined to isolate it as much as possible, find the place(s) where its complexity is still relevant, if any, then put it closer to where that action is happening.
I think doing so is part of a broader direction to encapsulate inference and LayoutElements so that work is not coupled to the Element and ElementMetadata core. In the end inference is just another way to produce Elements having ElementMetadata and engineers shouldn't need to understand or reason about how it works to work effectively with Elements and metadata on partitioners that don't involve inference.
Got it, thanks makes sense. Would any of that happen as part of the "reduce complexity" refactors you have in mind, or were you keeping that scoped to the parameters? Either way seems like this is good to keep on the backlog.
It could. These refactorings are bit like a game of Pick-up Sticks in my experience, when it's done well. You want to keep PRs small, such that they're quick to review, and also keep them motivated by some sort of actual tangible improvement whenever possible. Each of these is a "stick" so to speak :)
It definitely makes sense to start with ones that have no dependencies, and next move to "sticks" whose dependencies have been removed by prior steps. I'm inclined to think this add_element_metadata() bit is in the middle somewhere. Roughly like:
- some parameters are used by post-partitioning decorators
- at least some consolidation of those decorators is advisable, like combining
@process_metadata()and@add_metadata_with_filetype()such that the sequence of steps each performs can be interleaved to make it easier to reason about post-processing. - at least one post-partitioning decorator calls
add_element_metadata() - there's probably an opportunity to reduce complexity there by avoiding that call.
So bite-sized forward progress, interleaved with bug-fixes that are eased by those localized refactorings, and keeping at it over time is the general approach I'd say. Definitely not some sort of big-bang approach.
@scanny - Did you recent refactor address this at all? I remember you mentioning you some of the helper function couplings.
@MthwRobinson it did in as much as partition_html() used to use add_element_metadata() in the main partitioning loop and doesn't anymore.
It is still being used indirectly via the @add_metadata_with_filetype() decorator for the filename and url metadata items (one or the other). That decorator is used by a lot of partitioners so will need some specific attention in another PR to remove that use.
https://github.com/Unstructured-IO/unstructured/blob/main/unstructured/file_utils/filetype.py#L601