phenopacket-schema icon indicating copy to clipboard operation
phenopacket-schema copied to clipboard

Clarify how can applications tell which top level element is intended

Open cmungall opened this issue 3 years ago • 5 comments

There are 3 top level elements (TLE) in phenopackets:

https://phenopacket-schema.readthedocs.io/en/latest/toplevel.html

All of the examples here:

https://phenopacket-schema.readthedocs.io/en/latest/examples.html

Use a phenopacket as a top level element

However, other repos have examples that use other top level elements; e.g. https://github.com/phenopackets/phenopacket-tools/blob/gh-pages/examples/families/family.yml

If an application is presented with a phenopacket document D, how should the application determine how to interpret it?

  1. Attempt to parse using each TLE schema until it finds one that passes. Note that in certain perverse cases, this could lead to abiguity
  2. The application should attempt to sniff the right TLE from the filename. E.g. in the example above "family.yml" looks like family should be the TLE
  3. Behavior is undefined, and a phenopacket-conforming application must receive a tuple of two messages, both the document D plus an additional TLE type designator T

None of these seem particularly satisfactory. Perhaps future versions of phenopackets could include a type designator field in each TLE so applications can clearly and unambiguously interpret a document

cmungall avatar Nov 17 '22 17:11 cmungall

It looks like phenopacket-tools is going with strategy 3 and 1:

  • https://github.com/phenopackets/phenopacket-tools/issues/111

i.e the user should specify the element type (strategy 3), otherwise default to strategy 1.

I think this should be better documented in the main schema repo so all applications can implement analogous strategies

cmungall avatar Nov 17 '22 17:11 cmungall

A possible extension could be to use a wrapper object which will explicitly provide the type. Also this could allow other structures, such as the Phenopacket GA4GH Pedigree implementation to be added.

message PhenopacketWrapper {
    oneOf message {
        Phenopacket phenopacket = 1;
        Family family = 2;
        Cohort cohort = 3;
        // Ga4ghPedigree pedigree 4;  // possible additions might include this
    }
}

e.g.

# this is definitely a phenopacket, because it says so...
---
phenopacket:
    id: 12345
    subject:
        id: "Bart"
    phenotypicFeatures:
        - type:
              id: "HP:0000952"
              label: "Jaundice"

rather than this

# this is probably a phenopacket, because it has a top-level 'subject' field 
---
id: 12345
subject:
    id: "Bart"
phenotypicFeatures:
    - type:
          id: "HP:0000952"
          label: "Jaundice"

julesjacobsen avatar Jan 06 '23 13:01 julesjacobsen

It is possible to implement "sniffing" - determine the format (YAML, JSON, or Protobuf) and the top-level element, at least to some extent. A simple strategy can test if the document looks like YAML or JSON using file signatures. JSON should start with {, YAML has lines with comments (#), document separators (although I'd prefer not seeing them in our setting), or top-level fields. If this fails, the sniffer can look for magic bytes (e.g. gzip) or fall back to Protobuf (or throw).

Sniffing top-level element can also be done to some extent. For JSON and YAML, it is possible to search for discriminatory top-level fields - fields that can be found only in specific elements (e.g. pedigree in Family). Unfortunately, I am not sure this simple algorithm will work with Protobuf, since the field names are not part of the payload.

Sniffing can help but it will always be fallible (I think) unless we add a wrapper element. The wrapper would, however, cause other pain..

I implemented the sniffing in phenopacket-tools

ielis avatar Apr 05 '23 20:04 ielis

This is a great feature! It might be good to add a --strict flag (or to make this feature optional), because sometimes it is good for software to die if the input is not clear.

pnrobinson avatar Apr 05 '23 21:04 pnrobinson

@pnrobinson yeah, the sniffing is turned off if the user is explicit with the input using -f | --format or -e | --element CLI options.

ielis avatar Apr 06 '23 14:04 ielis