Graph model for domain metadata

Open homedirectory opened this issue 1 year ago • 1 comments

Description

In TG there is a subsystem responsible for generation of domain metadata. It was created specifically for EQL.

EQL3 metadata has the following structure:

Unordered set of entity metadata instances, where each instance:
Has a set of property metadata instances corresponding to its declared properties.
Composite/Union property metadata includes a set of subproperty metadata instances.

eql3-metadata-full

This structure resembles a graph, although it is more akin to a forest.

There are several places where it is replicated:

EqlPropertyMetadata, EqlEntityMetadata, EqlEntityMetadataHolder.
QuerySourceInfoProvider, QuerySourceInfo, AbstractQuerySourceItem.
AbstractRetrievalModel (less resemblance, but could benefit by reusing a similar abstraction)

Each implementation of the structure is different and, as a result, supports only specialised operations. One consequence is that more effort is required to understand, use and maintain them.

It is proposed to introduce a single abstraction for modelling entities and their properties -- a graph. Graphs are well-established data structures, and there are Java libraries that provide them:

The previous example could be modelled with a graph as follows:

graph-metadata-full

This structure has the following differences:

Properties are represented as graph arcs instead of tree nodes.
The type of a property is given by the tail of its arc (i.e., near the arrow) rather than being a part of it.

Note that such graphs could easily accomodate loops and directed cycles.

The advantages of this abstraction is the ability to reduce all operations to graph traversal, which should lend itself nicely to reuse and provide clarity of expression.

Expected outcome

A graph model for domain metadata is evaluated and adopted if deemed to be fitting and:

Exhibits reasonable memory consumption (TBD quantitavely).
Graph traversal can be clearly expressed and arbitrary traversal rules can be applied.

May 09 '24 08:05 homedirectory

This is an attempt to answer the following questions:

How to structure top-level metadata?
- Graph
- Forest
How to structure property metadata?
- Should it include the enclosing type?
- Should it include the property type?

The following criteria should be used to compare alternatives:

Space complexity of metadata structures.
Time complexity of metadata operations.
Metadata interface -- simplicity is desirable.

Metadata Overview

In general, metadata describes:

Types
- Entities
  - Nature (persistent, synthetic, union)
  - Nature-specific metadata
    - Persistent: table name
    - Union: query model for each member property
    - Synthetic: query models
  - Information from annotations (title, desc)
- Composite
- Primitive
- Collectional
- Other (PropertyDescriptor, ...)
Properties
- Nature
- Nature-specific information
  - Persistent: column name
  - Calculated: expression
  - Collectional: link property
  - Crit-only: mnemonics
- Auxiliary information (requiredness, key member, ...)

Metadata Structure Options

Type metadata can be expressed with the following ADT:

TypeMetadata = EntityMetadata 
             | CompositeTypeMetadata 
             | PrimitiveTypeMetadata 
             | CollectionalTypeMetadata
             | OtherTypeMetadata

Let Tn be the total number of types and Pn the average number of properties in a type.

Options for TypeMetadata structure (referred to as TMn):

TM1 Stores a set of PropertyMetadata (applicable only to [Entity,CompositeType]Metadata).

This may not necessarily be a set but any data structure.

Let TM1_S = O(Tn * Pn) be the space required to store properties.

Let TM1_Pt be the time required to retrieve a property given this structure. Most likely Map String PropertyMetadata will be used, thus TM1_Pt = O(1).

Options for PropertyMetadata structure (referred to as PMn):

PM1 Stores enclosing EntityMetadata in a field.

Increases space complexity by PM1_S = Tn * Pn.
PM2 Stores property type TypeMetadata in a field.

Increases space complexity by PM2_S = Tn * Pn.

Metadata Operations

Assume that DomainMetadata is the entry point for accessing metadata.

Operations on metadata (referred to as OPn):

OP1 Retrieve type metadata by class.

DomainMetadata.typeMetadata :: Class -> TypeMetadata

OP2 Retrieve property metadata given its enclosing type and name.

If TM1, then:

EntityMetadata.propertyMetadata :: String -> PropertyMetadata
CompositeTypeMetadata.propertyMetadata :: String -> PropertyMetadata

Otherwise:

DomainMetadata.propertyMetadata :: EntityMetadata -> String -> PropertyMetadata
DomainMetadata.propertyMetadata :: CompositeTypeMetadata -> String -> PropertyMetadata

OP3 Access enclosing type of a property.

If PM1, then:

PropertyMetadata.enclosingType :: EntityMetadata | CompositeTypeMetadata

Otherwise:

DomainMetadata.enclosingType :: PropertyMetadata -> EntityMetadata | CompositeTypeMetadata

OP4 Access the type of a property.

If PM2, then:

PropertyMetadata.type :: TypeMetadata

Otherwise:

DomainMetadata.propertyType :: PropertyMetadata -> TypeMetadata

TM1, PM1 and PM2 share the following:

If present:
- Metadata interface: simple.
  
  OP2, OP3 and OP4 don't require access to DomainMetadata.
- Space complexity: potential overhead if the same relationships are stored in the top-level metadata structure.
- Time complexity: reduced.
  
  No need to access DomainMetadata which may perform the operaion inefficiently.
If absent:
- Metadata interface: increased complexity.
  
  OP2, OP3 and OP4 require access to DomainMetadata.
- More dynamic control over type-property relationships since they are not tied to PropertyMetadata. Although, this advantage seems to bring no benefit, as type-property relationships remain fixed after metadata generation.

Top-level metadata structure

In describing how each structure performs operations outlined above it is assumed that neither of PM1, PM2, TM1 are present, as their presence would remove the need for the top-level metadata structure to support these operations.

Graph

Internal representation:

TypeMetadata as nodes.
PropertyMetadata as arcs source -> target, where source is the enclosing type and target is the property type.

Operations:

OP1 Achievable by additionally introducing Map Class TypeMetadata.
OP2 Achievable by: a. Retrieve [Entity,CompositeType]Metadata through OP1 if not given it directly. b. Find PropertyMetadata with the given name among outgoing edges.
OP3 Find source of the arc.
OP4 Find target of the arc.

Criteria:

Space complexity: A * (Tn * Pn) + Tn + M.

A -- space required for a single arc:
- source :: TypeMetadata
- target :: TypeMetadata
- value :: PropertyMetadata
M = Tn * 2 -- space required for efficient OP1.

Lower bound: O(A * (Tn * Pn) + Tn). Upper bound: O(A * (Tn * Pn) + Tn + PM1_S + PM2_S + TM1_S).

Leaves choices of TM1, PM1, PM2 open.
Time complexity. Let Nt and Et be the time required to find a node and an arc in a graph respectively.

OP1 -- O(1)
OP2
- With TM1: O(1) + TM1_Pt
- Without TM1: O(1) + Nt + O(Pn / 2)
OP3
- With PM1: O(1)
- Without PM1: O(1) + Et + O(1)
OP4
- With PM2: O(1)
- Without PM2: O(1) + Et + O(1)

Forest

Internal representation: Set TypeMetadata. This representation requires TM1 to be able to access properties from types.

Operations:

OP1 Achievable by changing the internal representation to Map Class TypeMetadata.
OP2 Achievable by: a. Retrieve [Entity,CompositeType]Metadata through OP1 if not given it directly. b. Find PropertyMetadata with the given name in the set of properties.
OP3 Search through all TypeMetadata for a type that encloses the given property.

Inefficient without PM1.
OP4 Impossible without PM2.

Criteria:

Space complexity: D + K + L.

D -- space required for the top-level structure:
- Set TypeMetadata: Tn.
- Map Class TypeMetadata: Tn * 2.
K -- optional overhead that depends on the structure of property metadata. Given that we want both effiency of OP3 and possibility of OP4, K = PM1_S + PM2_S.

L -- space required by TM1 to make this data structure work at all. L = TM1_S

Lower bound: O(Tn + PM1_S + PM2_S + TM1_S). Upper bound: O(Tn + PM1_S + PM2_S + TM1_S).

Strictly requires TM1, PM2. Requires PM1 for efficiency of OP3.
Time complexity:

OP1 -- O(1).
OP2 -- O(1) + TM1_Pt, due to TM1.
OP3
- With PM1: O(1).
- Without PM1: O(Tn * Pn).
OP4 -- O(1), due to PM2.

Conclusion

Graph offers more freedom in choosing the structure for type and property metadata:
- Space complexity of type and property metadata can be reduced at the cost of increased time complexity for operations. Although, this cost will depend on the graph implementation (Nt and Et).
Forest imposes some structure options on type and property metadata: TM1, PM2. PM1 is almost as strictly imposed, since its absence results in highly inefficient OP3. Forest operations fair better in terms of time complexity thanks to the increased space complexity.

May 09 '24 19:05 homedirectory