quilt icon indicating copy to clipboard operation
quilt copied to clipboard

Object level metadata from JSON schema

Open birnbera opened this issue 2 years ago • 0 comments

This relates to #2963 but I wanted to create a separate issue as it is a very different method to update metadata in packages. I'm posting this here as an interesting option for other users and something to consider for inclusion as Quilt feature in future releases.

When creating packages it is usually straightforward to add package level metadata without too much effort. However, adding metadata to the individual objects can be challenging. In our case, we already store some metadata in the path to our files, such as sample IDs and several other types of entity IDs depending on the use case. Since Quilt is already includes logic to validate individual entries in a package manifest, I found a way to use that same schema to infer metadata for objects based on their path.

When Quilt performs entry validation in a workflow it generates a list of Python dictionaries, with the keys logical_key, size, and meta:

https://github.com/quiltdata/quilt/blob/7051b2b3dece618140f73bc17a390910df2acd58/api/python/quilt3/workflows/init.py#L264-L280

The meta key refers to the user_meta subkey of the object's metadata. If you create a JSON schema that matches a logical_key using a regex pattern, it is possible to include named capture groups, e.g.:

{
    "type": "object",
    "properties": {
        "logical_key": {
            "type": "string",
            "pattern": "^samtools/(?P<sampleId>[^/]+)/[^/]+\\.txt$"
        }
    }
}

Normally, named captures have no effect during validation other than documentation purposes. However, it is possible to extend a built in jsonschema validator with additional logic. In our case, we have updated the object properties validator to assign metadata to the meta dictionary before proceeding with validation. This is the code used to do this:

import re

from jsonschema import Draft7Validator, validators


def extend_with_meta_assignment(validator_class):
    validate_properties = validator_class.VALIDATORS["properties"]

    def set_meta_from_pattern(validator, properties, instance, schema):
        if not validator.is_type(instance, "object"):
            return

        if "logical_key" in properties and "meta" in properties:
            lkey_subschema = properties["logical_key"]
            meta_subschema = properties["meta"]

            if validator.is_valid(instance.get("logical_key"), lkey_subschema):
                if not validator.is_valid(instance.get("meta"), meta_subschema):
                    meta = instance.setdefault("meta", {})
                    # Pattern has to match logical_key
                    m = re.search(lkey_subschema["pattern"], instance["logical_key"])
                    for prop, entity_id in m.groupdict().items():
                        meta[prop] = entity_id

        # Descend and process as normal
        for error in validate_properties(
            validator,
            properties,
            instance,
            schema,
        ):
            yield error

    return validators.extend(
        validator_class,
        {"properties": set_meta_from_pattern},
    )


MetadataAssignmentValidator = extend_with_meta_assignment(Draft7Validator)

After validation with MetadataAssignmentValidator, the object that was passed in has updated meta fields based on the named captures in the pattern. This object can be used to update each PackageEntry before building/pushing the package.

There are a couple of things to watch out for:

  1. You want to be careful about matching multiple subschemas. The oneOf property is useful here:
"type": "array",
  "items": {
      "oneOf": [ {...} ]
  }
  1. Directly using the get_pkg_entries_for_validation function from the linked code above would be a mistake because it uses an optimization to save on memory be reusing a single empty dictionary when no metadata is already present on package entries. This could lead to all fields being present on all items since potentially every item's meta would be a reference to the same object.
  2. This only works for Python-style regular expressions. JS named captures use a different syntax so if you want to maintain a single set of entry schemas for validation and setting metadata Quilt has to continue using a Python JSON schema implementation.

birnbera avatar Oct 05 '23 00:10 birnbera