Tim Berners-Lee's 5 star Open Data is a really cool mental model of how to think about open data quality. Check out the website if you havent seen it: https://5stardata.info/en/

But I think it doesn't quite fit what most developers would consider usable data, so it might make sense to provide a different list that focuses on data reusability.

Mostly, it lacks typed data as a deliminator - whether the data has a machine readable schema. Personally, I think this is one of the most important characteristics. It's probably one of the main reasons why SQL is so incredibly popular, or why pretty much all programming languages have things like Structs or Classes with (type-safe) properties. But not all data has this, so I think it should be a distinction layer - a separate level, if you will.

Also, we can introduce verifiability of data, powered by Atomic Commits (or any other technology that does something similar).

I'm not sure whether we should call it '5 levels', it's definitely not as catchy as '5 stars'. I'm also not fully certain about 'usability', but I think it describes what I mean pretty well.

Anyways, here's a work in progress / draft. Feel free to share ideas / criticism / thoughts!

========

5 Levels of data reusability

Not all data are created equal. There are notable differences in how much you can do with data and how much effort it takes. The more reusable data is, the easier it will be to use it as a developer, researcher or other type of data user. Re-useability is about being able to transform, sort, query, serialize, modify, render and audit data without requiring too much work.

This list is inspired by Tim Berners-Lee's 5-star open data.

Level 0: proprietary data

If you don't give others the rights to read, use or modify your data, it's reusability is zero.

That's why it's important to have a license that allow others to use your data. A good choice for a permissive option is the Open Database License. Creative Commons licenses are also good options to clearly communicate if, and if so then how, your data is permitted to be re-used.

It's also important to use open formats (such as CSV, JSON or PNG), instead of proprietary formats (tied to specific vendors, such as PSD or RAR).

Level 1: unstructured data

Examples: images, videos, plain text

Unstructured data is the least usable. Humans can read it, and AI / Machine Learning systems can draw more conclusions from it then ever, but it's hard to build an actual application or graphic from only unstructured data.

Hi! I'm Joep, I'm born in 1991.

Level 2: structured data

Examples: CSV, XML, JSON, TOML, EXCEL

Structured data can be read by machines, and this allows us to do all sorts of useful things. We can query, sort and filter. But still, this type of data often requires human input when it needs to be processed. And we don't have guarantees about which fields will be filled, or what their datatypes are. One time, a birthYear can be a string, and the next time it can be a number. Data can be structured, but still unpredictable.

{
  "name": "Joep",
  "birthYear": 1991
}

If we want predictability, we need to make it type-safe.

Level 3: type-safe data

Examples: SQL + DB SCHEMA, JSON + JSON schema, XSD + XML, RDF + SHACL, In-memory data in type-safe programming languages

Type-safe data means that every value of the data has an explicit datatype. It is strongly typed and has a clear schema that describes which properties you can expect in a Resource. This means that someone re-using type-safe data can know for certain that it conforms to a specification, a set of rules. The shape of the data is predictable. This predictability means that developers can safely re-use it in their system without worrying about missing fields or datatype errors.

Lots of software has internal type safety, especially if you use type-safe programming languages like Typescript, Kotlin or Rust. However, when the data leaves the system, a lot of type related data is lost. Even if this schema related information is described, the schema itself is often not machine-readable. The best way to have type-safe data, is to describe the schema in a machine-readable format.

In SQL, we can use a DB schema. In JSON, we can add a JSON Schema file. For XML, we have XSD.

In Atomic Data, the Properties themselves (the links in the keys in JSON-AD) describe the required datatypes, which helps developers when re-using data understand what they can expect from a value.

{
  "https://atomicdata.dev/properties/isA": ["https://atomicdata.dev/classes/Agent"],
  "https://atomicdata.dev/properties/name": "Joep",
  "https://atomicdata.dev/properties/birthYear": 1991,
  "https://atomicdata.dev/properties/worksOn": "Atomic Data",
}

Level 4: browsable data

Examples: Atomic Data, properly hosted RDF

If your data is connected to other pieces of machine-readable dat, is becomes browsable, similar to how websites link to each other. This effectively creates a web of data, and allows for a whole new way to think about the internet. This is what allows decentralized applications, true data ownership, and a new set of applications.

{
  "https://atomicdata.dev/properties/isA": ["https://atomicdata.dev/classes/Agent"],
  "https://atomicdata.dev/properties/name": "Joep",
  "https://atomicdata.dev/properties/birthYear": 1991,
  "https://atomicdata.dev/properties/worksOn": "https://atomicdata.dev",
}

Level 5: verifiable data

Examples: Atomic Data + Atomic Commits

When your data is verifiable, other people can verify who created it and modified it. They can use cryptography to validate signatures, which proves that one person or machine created a piece of data.

{
  "https://atomicdata.dev/properties/isA": ["https://atomicdata.dev/classes/Agent"],
  "https://atomicdata.dev/properties/name": "Joep",
  "https://atomicdata.dev/properties/birthYear": 1991,
  "https://atomicdata.dev/properties/worksOn": "https://atomicdata.dev",
  "https://atomicdata.dev/properties/previousCommit": "https://atomicdata.dev/commits/EF18751AE781",
}

Mar 02 '22 22:03 joepio

Unfortunately, I think there are some details that I disagree with.

Firstly, it lacks typed data [...] Secondly, I think it could be a bit shorter.

Ha ha, so first you argue that it should be longer, and then that it should be shorter.

I think everyone can agree with you that the 5-star is a great simplification and that more qualities are relevant, with varying importance at different contexts.

That you find that the freedom quality is unimportant is to me simply a sign that you live in a spoiled age and environment where you can take that freedom for granted. I live in similar environment as you, but others are less fortunate: e.g. access to internet is not a given, and crippled alternatives are a real threat.

Copyleft licensing (e.g. GPL, as opposed not only to proprietary licensing schemes but also to liberal licensing like MIT and BSD) is more important than ever, and less popular than ever.

Mar 02 '22 23:03 jonassmedegaard

@jonassmedegaard I agree that open licenses are important (I spend a lot of time on promoting open data), but I think it's not on the same scale / dimension / axis as usability. Also, the focus of Tim's list was on Open data, whilst this is on data usability. Lots of data is closed / private, yet should still be usable for colleagues, for example. But maybe I should add a level 0 for closed data, or data with a proprietary license, similar to Tim's model. At least I should clarify the difference in focus.

For me the most important change is adding type safety / schema requirement in there. The lack of type safety is such a big issue with linked data nowadays, and a big disconnect to how most developers practically reason about data usability.

Mar 03 '22 07:03 joepio

Can you give any examples of gaining the 4th star without defining a schema and therefore types for your data?

Mar 03 '22 09:03 jonassmedegaard

Can you give any examples of gaining the 4th star without defining a schema and therefore types for your data?

Examples of RDF without a schema? Well, that's pretty much all of the RDF, unfortunately. Only a small subset has SHACL / SHeX descriptions, and to make things worse, these descriptions are not trivial to discover by looking at the predicates / triples itself.

Or do you mean something else?

Mar 03 '22 09:03 joepio

ah, so by "types" you don't mean that a FOAF:birthday must be within rdfs:range of rdf-schema#Literal, but instead (or additionally) e.g. some rules that a FOAF:Agent can have only 0 or 1 birthdays.

If I understand that correctly, then I recommend that you describe your requirement as extended types or shapes, to avoid confusion with OWL-style RDF Schema types.

Mar 03 '22 09:03 jonassmedegaard

It is my understanding that all RDF is tied to classes and therefore all at least loosely typed.

Seems to me that you want to require strongly typed data.

Mar 03 '22 09:03 jonassmedegaard

I suspect that your push at promoting a qualifier list will fare better if you don't disguise it as an evolution of TimBL's list - your list is fundamentally choosing to prioritize a different set of qualities, and using another set as basis will confuse rather than help see where you are going with your list.

Mar 03 '22 09:03 jonassmedegaard

ah, so by "types" you don't mean that a FOAF:birthday must be within rdfs:range of rdf-schema#Literal, but instead (or additionally) e.g. some rules that a FOAF:Agent can have only 0 or 1 birthdays.

If I understand that correctly, then I recommend that you describe your requirement as extended types or shapes, to avoid confusion with OWL-style RDF Schema types.

rdfs:range is not to be used for shape validation, but for inferencing new triples. That's part of what makes RDF schema's so confusing, and why I would not consider all RDF to be typed.

It is my understanding that all RDF is tied to classes and therefore all at least loosely typed.

Seems to me that you want to require strongly typed data.

I think we have some confusion about type-safe. You might be right that strongly typed is a more fitting term here. I mean that developers can safely make assumptions about data shapes. They know that Persons have a BirthDate, and that that BirthDate is an ISO datetime, for example. The first relates to a Class description, the second one to a property datatype.

Would renaming it to stronly typed remove your concern?

I suspect that your push at promoting a qualifier list will fare better if you don't disguise it as an evolution of TimBL's list - your list is fundamentally choosing to prioritize a different set of qualities, and using another set as basis will confuse rather than help see where you are going with your list.

Yeah, perhaps you're right. I do want to give credits to the original list, at least, but perhaps make the difference in goals clearer.

Mar 03 '22 10:03 joepio

rdfs:range is not to be used for shape validation

True - it is you not I that insist types must involve shape.

rdfs:range is to be used for triple validation.

You wrote in your draft at section 3:

Type-safe data means that every value of the data has an explicit datatype, and that these datatypes can be constrained. This means that someone re-using this data can know for certain that it conforms to a certain specification, a set of rules. The shape of the data is predictable.

Would be lovely if there could be a promise that all data is predictable, but you are overselling above: tying a triplet to an OWL class with rdfs:range defined is a way to have an explicit datatype, and have them be constrained. But does not promise predictability.

Seems you want to mandata SHACL in your quality assessment. Which is fine - just don't call it a refinement of TimBL's quality assessment because it really is something else, and trying to bootstrap yours from his will lead to confusion (at best).

Would renaming it to stronly typed remove your concern?

Renaming to SHACL and dropping the introduction tying it to TimBL's 5-star model would remove my concerns.

Possibly rewriting to strongly typed, or to only reference TimBL's 5-star model as inspiration for this, might each remove one of my concerns - depends of how exactly it is rephrased ;-)

Mar 03 '22 11:03 jonassmedegaard

SHACL constraints are about the structure which makes the data easily validatable in the constructs you designed the SHACL definitions for.

OWL constraints are about the meaning which makes the data agile - allows it to fit into constructs you didn't imagine ahead.

You can have a SHACL structure for a match-making construct that says "men attracts women" and "women attracts men", which works well for your business but makes your data unusable for another context of a gay match-making construct.

After a quick search and reading only the first few paragraphs so far, I encourage you to (also, I will finish doing so now myself) read this: https://www.semanticarts.com/shacl-and-owl/

Mar 03 '22 12:03 jonassmedegaard

If your only goal is to gather data that fit into existing knowledge, then SHACL is adequate.

If you want to gather data potentially revealing new knowledge you didn't or couldn't foresee, then OWL is one way to (almost literally) "keep an open mind" when type-casting!

Mar 03 '22 12:03 jonassmedegaard

I.e. if you care only about developers (working towards a well-defined product), then SHACL is adequate for you. Just please don't assume it is adequate for all of us (which I dare say TimBL's 5-star model is - not as only qualities but as sensible universal core qualities).

Mar 03 '22 12:03 jonassmedegaard

I.e. if you care only about developers (working towards a well-defined product), then SHACL is adequate for you. Just please don't assume it is adequate for all of us (which I dare say TimBL's 5-star model is - not as only qualities but as sensible universal core qualities).

I don't think too much discussion on the merits of SHACL, OWL or RDFS are relevant regarding this issue - they indeed do serve different needs. All I want is to make readers understand the value of typed-ness. That's why I've included a list of examples (XSD + XML, JSON + JSON-SCHEMA, RDF + SHACL). I think that adding RDF+OWL to this list would only contribute to the existing confusion that is already harming the semantic web community on the impossibility of using OWL for shape / type validation.

Possibly rewriting to strongly typed, or to only reference TimBL's 5-star model as inspiration for this, might each remove one of my concerns - depends of how exactly it is rephrased ;-)

Fair points, I've clearly mentioned strongly typed now and changed the introduction of this issue.

Mar 03 '22 12:03 joepio

I've also added a level 0 for proprietary data, and mentioned open licenses @jonassmedegaard what do you think?

Mar 03 '22 12:03 joepio

🤷

Mar 03 '22 13:03 joepio

I think maybe the underlying issue is that your "user" is a developer, where my user is either a developer of a non-developer.

This leads to your "usability" being quite different from mine.

Data that is verifiable might by a developer be described as more (re-)usable.
A non-developer would more likely describe that as more stable or more flexible.

I see TimBL's 5-star model as promoting usability where your model is about "developability".

Makes sense that your target audience is developers. But I urge you to consider using terms not confusing if a developer and a non-developer meets and exchange ideas - where one of them has been enlightened by the 5-star model and the other has been enlightened by your different model.

Perhaps the solution is to replace all occurences of "usable" with "robust and flexible"? Not sure...

Mar 03 '22 13:03 jonassmedegaard

I don't want to just target developers - but anyone using data. It depends on what you constitute as using data. I think of things like displaying the data, querying it, sorting it, filtering it, converting it, serializing it, storing it... All of these things are actually very hard to do with RDF. You can call this developability, but I fail to see how this is different from usability.

So I think the 5* model does not promote usability at all. Maybe the first 3 stars do, but 4 and 5 do not. I've spent enough years going all-in on RDF stacks to know why this is - and this is exactly what has led me to start work on Atomic Data. The core problem is that it is easy to create RDF, but using it is difficult. It's hard to select a single value, or build any system that has some functional requirements on top of it. This page in the docs goes into more specifics, but I'm pretty sure you're already familiar with that.

But I agree that usability should at very least be more clearly defined. Robust and flexible are definitely things that matter, but it's maybe more about the freedom that a data user has. Can they query it? Can they convert it and serialize it how they want? Can they sort / filter it reliably? Can they build an app with it? Can they create statistics from it? All of these things, for me, have a strong dependence on strict schema's / type safety.

Mar 03 '22 13:03 joepio

I am well aware that Atomic Data grew out of frustration of existing RDF was in your view unusable for developers.

...and I am also aware that my framing of that frustration as done above is not how you would frame it - I sneaked in "existing" because your own JSON-LD is RDF too, and I sneaked in "for developers" because arguably only developers knowingly display, query, sort, filter, convert, serialize, or open it - non-developers "apply it" and see if it sticks or not.

So I think the 5* model does not promote usability at all. Maybe the first 3 stars do, but 4 and 5 do not.

Really? The example for 4-star data even explicitly mention a use case for a consuming user:

use URIs to denote things, so that people can point at your stuff4

...describes the use case of applying data to your own communication - e.g. link to it from a blog post or a tweet or a toot or an sms, which will "stick" for 4-star data but fail for lesser-star data.

Those examples are really about requirements, though (not use cases) - 4-star example only mention a use case because the requirement of "denotability" is difficult to describe otherwise.

The costs and benefits section is (also not about but closer related to) use cases.

With 4-star data, consuming users can...:

link to it from any other place (on the Web or locally).
bookmark it.
reuse parts of it.
maybe reuse existing tools and libraries, even if they only understand parts of the pattern the publisher used.
combine it safely with other data. ...and producing users can...:
control it with fine granularity and optimise access to it (load balancing, caching, etc.)
enable other data publishers to link into it, promoting it to 5 star!

With 5-star data, consuming users can...:

discover more (related) data while consuming it.
directly learn about its schema. ...and producing users can...:
make it discoverable.
have increases value of it.
gain same benefits from the links as consuming users.

You may not agree that any of those use cases are relevant for your target audience, but that's very different from saying that the model "does not promote usability at all."

Can they $FOO with 5-star data? Depends on what exactly $FOO is, but 5-star data is a fine core set of requirements - take away ANY of the 5 stars and you have a horrible dataset, also for Atomic Data use.

Suggestion: Write an addendum to 5-star data, adding a 6th star for strongly typed data.

Mar 03 '22 14:03 jonassmedegaard

"Atomic Data" ensures 5-star data and adds an additional 6th bonus star"

Mar 03 '22 14:03 jonassmedegaard

...because as I see it, you introduce only 1 additional quality: stricter shape-based types", where one notable benefit is improved verifiability.

...because that's a main driver for defining graph shapes e.g. with SHACL: Simpler and more efficient validation than is practically possible with RDFS and OWL (reasoning is amazing in theory, but complex and heavy in many real-world scenarios!).

Mar 03 '22 14:03 jonassmedegaard

Suggestion: Write an addendum to 5-star data, adding a 6th star for strongly typed data.

I still think that it's great to adhere to all the stars in the original modal, of course. The core problem is that I disagree with that there is no schema / type safety in the 5 stars, although it's one of the most fundamental needs for pretty much any piece of data. You're suggesting I should add it as step 6, but for me that's far too high in the ladder - it should be at step 3, not step 6. And that's what developers all over the world have been doing for years, too: every SQL user has a strict schema, and knows why that's important. Having great typing support is such an important need, that it should not be some level 6 thing on top of all others. It's a far more fundamental need than using URLs. If there is anything that the last 20 years of semantic web has shown, is that simply adding URLs (without fixing protocol and schema) only makes developer steer clear of the semantic web. We have so much RDF on the web that is practically unusable... Data becomes truly browsable if machines can browse it.

...because as I see it, you introduce only 1 additional quality: stricter shape-based types", where one notable benefit is improved verifiability.

The verifiability has nothing to do with shapes or schemas - it's because of commits and cryptographic signatures. It's really a different quality and a different piece of spec. I don't think it's possible to semantically combine it with the schema / type-safe step.

Mar 03 '22 15:03 joepio

for me that's far too high in the ladder - it should be at step 3, not step 6

Then add a 2½th or 3½th star.

verifiability has nothing to do with shapes or schemas

Then add a separate star for that.

My point is not the order, but to extend (not replace): I think it is wrong to ditch any of the original stars, and I think it is better to promote single addon stars than a full rewritten set of stars.

Mar 03 '22 15:03 jonassmedegaard

for me that's far too high in the ladder - it should be at step 3, not step 6

Then add a 2½th or 3½th star.

verifiability has nothing to do with shapes or schemas

Then add a separate star for that.

My point is not the order, but to extend (not replace): I think it is wrong to ditch any of the original stars, and I think it is better to promote single addon stars than a full rewritten set of stars.

Adding a 3.5 star would be confusing, and then the 4 and 5 stars would no longer be applicable to what they represent in Tim's list (namely all rdf, instead of only properly typed and resolvable rdf).

And adding 6 stars just doesn't make intuitive sense, stars mostly go from 0 to 5.

Extending leads to a confusing list and an unclear message. I just want an article of one page that helps people get an intuition for data reusability, and think the original 5 star list is an interesting inspiration.

Mar 04 '22 09:03 joepio