Classify data package non-compliance issues by severity
From a recent scan of GitHub data packages, roughly 67% validate successfully against the current data-package.json schema.
Due to the pass-fail nature of the validation, it's not readily apparent whether a package is wholly out of compliance (e.g. missing name field) or failed for a minor reason (e.g. specified an invalid media type for one of its resources). I propose assigning a severity to each issue, depending on both the language in the spec and the nature of JSON.
My proposal is to rank issues by the language involved (MUST being highest-severity) as well as whether the issue occurs under an optional parent. As an example, a missing top-level name field would be critical severity, while violating a MUST under a SHOULD would be at most medium severity (e.g. an invalid hash on a particular resource).
Here's a quick table to illustrate what I'm describing:
| Severity | Examples |
|---|---|
| Critical | Invalid JSON, missing required field |
| High | Type error on required field, regex error on required field |
| Medium | Type error on top-level optional field that has children (e.g. resources) |
| Low | Type error on optional field, regex error on optional field |
It might also be useful to include a warning level, for when an explicit recommendation given by a SHOULD is disregarded, e.g. omitting a name field on a resource, although many SHOULD directives appear difficult to enforce in an automated manner.
@Deiz this makes good sense. Generally we should not fail on an optional or even recommended field but just issue warnings. This is almost a separate recommendation about how validators work against the schema.
@Deiz as someone implementing libraries to support Data Package and JSON Table Schema: I see the problem that you are trying to solve, but it seems to me incredibly difficult to solve in a reasonable way.
Two examples you gave:
-
MUSTunderSHOULDas medium severity. To me, that seems wrong - it is fine that there is aSHOULD, but, IF theSHOULDis implemented, then theMUSTbecomes critical, and therefore we really do have a hard fail in this case. - Type error on an optional field as low. Again, I'd say this is clearly critical. the fact the field is optional is one thing, but IF the field is present, code handling the data can fail splendidly if the type is incorrect, e.g.: should be a date field but it is an array, and is supposed to be loaded into a Postgres ARRAY field.
So, my point is: we might try to provide some hinting system for ranking errors, but is this something that we can truly abstract meaningfully?
@rgrp @danfowler any comments on this? My opinion is still the same as the above comment. I think if we are not acting on this, we can close the issue.
@pwalsh this seems to me like notes for implementors - making me think we do need a patterns or primer section for this kind of thing. We sort of have this here already: http://data.okfn.org/doc/publish-faq.
I am going to label this as such now and will raise and issue generally about the FAQ / Patterns / Best Practice stuff.
@rgrp +1