Profile/Module - RO-Crate Convention to Include Schema and Metadata
Profile/Module - RO-Crate Convention to Include Schema and Metadata
This is now publlshed under a permanent URL. New versions can be found in the same repository in the future.
Index:
- Version
- Definitions
- Goals
-
Technologies and Usage
-
Schema Representation
- RDFS Class
- RDFS Property
-
Metadata Representation
- RDF Metadata Entry
-
Schema Representation
- Reference Examples
-
API
- Schema Representation DTOs
- Metadata Representation DTOs
- Additional RO-Crate API Methods
- API Reference Implementation in Java
- API Reference Examples in Java
- Ongoing Work
- Possible Future Directions
- People
Version
0.1.0, initial version, compatible with RO-Crate 1.1
Definitions
We use the following definitions in our proposal.
- Schema: A logical design that defines the structure, organization and relationship between data.
- Metadata: data of a database adhering to the schema.
- Ontology: A set of concepts and the relationships between these concepts.
Goals
This proposal SHOULD allow the means to exchange a database schema and database contents in a standardized way.
As consequence, Integrations SHOULD NOT need to parse individual files in non-standardized formats anymore to obtain such information but MAY use the Ro-Crate API for such purpose.
Since the goal is that multiple established systems can adhere to it, this poses the additional problem that are multiple schemas in use for similar concepts. To address this, we propose a way to annotate our schemas with ontological information. The ontologies allow identification of shared concepts. Knowing which concepts are shared allows easier integration for different schemas.
Establishing such a format for interoperability would also benefit independent interoperability efforts, as they would be available for reuse in other interoperability projects.
This specification is made to be usable in Ro-Crate 1.1, as such:
- It SHOULD NOT add new keywords.
- It SHOULD establish a convention that can be used by the RO-Crate API to read/write the information.
Technologies and Usage
- RDF: Resource Description Framework is a specification developed by the World Wide Web Consortium (W3C) to provide a framework for representing and exchanging data on the web in a structured way. RDF allows information to be described in terms of subject-predicate-object triples, which form a graph of interconnected data. RDF can be serialized in different formats, including JSON-LD as used by RO-Crate.
- RDFS: Resource Description Framework Schema is a specification developed by the World Wide Web Consortium (W3C) that extends RDF (Resource Description Framework). RDFS provides a way to define the structure and relationships of RDF data, allowing for the creation of vocabularies and the specification of classes, properties, and hierarchies in an RDF dataset.
- OWL: Web Ontology Language is a formal language used to define and represent ontologies on the web.
- XSD: XML Schema Definition is a language used to define the structure, content, and constraints of XML documents. It will be used in this specification to express primitive type.
Schema Representation
Because the schema is graph-based this can be easily integrated into the RO-Crate graph.
The schema could also be included in a separate file in a future version of this specification.
Ontologies are added using OWL's equivalentClass and equivalentProperty properties.
What are the advantages of this?
- the format is backward compatible
- this only uses features that RO-Crate already provides, no additional keywords are required
- Common format for export that prevents
n * (n - 1)integration situation - Thorough description of metadata, better automated checking and read-in
Formal description:
RO-Crate MUST include a graph description of the schema. This is expressed using 2 types:
- RDFS Class
- RDFS Property
RDFS Class
Based on RDFS classes, these can be used as object and subjects of triples.
| Type/Property | Required? | Description |
|---|---|---|
| @id | MUST | ID of the entry |
| @type | MUST | Is rdfs:Class |
| owl:equivalentClass | MAY | Ontological annotation https://www.w3.org/TR/owl-ref/#equivalentClass-def |
| rdfs:subClassOf | MUST | Used to indicate inheritance. Each entry has to inherit from something, this can be a base type. https://www.w3.org/TR/rdf-schema/#ch_subclassof |
RDFS Property
RDFS Properties, these represent predicates in triples. They also specify, which classes they can interact with.
| Type/Property | Required? | Description |
|---|---|---|
| @id | MUST | ID of the entry |
| @type | MUST | Is rdfs:Property |
| owl:equivalentProperty | MAY | Ontological annotation https://www.w3.org/TR/owl-ref/#equivalentClass-def |
| schema:domainIncludes | MUST | Describes the possible types of the subject. This can be one or many. |
| schema:rangeIncludes | MUST | Describes the possible types of the object. This can be one or many. |
Metadata Representation
Formal description:
RO-Crate MUST include a graph description of the metadata entries. This is expressed using 1 type:
- Metadata Entry
RDF Metadata Entry
A metadata entry, described by a RDFS class.
| Type/Property | Required? | Description |
|---|---|---|
| @id | MUST | ID of the entry |
| @type | MUST | Type of the entry, MUST be a RDFS Class |
Further properties are included as specified in the RDFS description as fields.
Reference Examples for both Schema and Entries
We created a small example. It can be found under:
./examples/ro-crate-1.1/ro-crate-metadata/ro-crate-metadata.json.
This describes the export
of ./examples/reference-openbis-export.
API
Formal description:
To be general, the API uses a lot of strings. This allows flexibility in the classes being used.
The interfaces are shown using Java since is a statically typed language, but they can be implemented in most languages, including Python and Javascript.
Schema Representation DTOs
/* Represents a class, if we are talking about a schema, it is closely related with the definition of a table or type */
interface IType
{
/* Returns the ID of this type */
String getId();
/* Returns IDs of the types this type inherits from */
List<String> getSubClassOf();
/* Returns the ontological annotations of this type */
List<String> getOntologicalAnnotations();
}
/* Represents a property in a graph, if we are talking about a schema, is closely related with a table column or type property */
interface IPropertyType
{
/* Returns the ID of this property type */
String getId();
/* Return possible values for the subject of this property type */
List<String> getDomain();
/* Return possible values for the object of this property type */
List<String> getRange();
/* Returns the ontological annotations of this property type */
List<String> getOntologicalAnnotations();
}
Metadata Representation DTOs
/* Represents a metadata entity. It is described */
interface IMetadataEntry
{
/**
* Returns the ID of this entry
*/
String getId();
/* Returns the type ID of this entry */
String getClassId();
/* These are key-value pairs for serialization. These are single-valued.
* Serializable classes are: String, Number and Boolean */
Map<String, Serializable> getValues();
/* These are references to other objects in the graph.
* Each key may have one or more references */
Map<String, List<String>> getReferences();
}
Additional RO-Crate API Methods
/* The API to program against, this wraps around existing RO-Crate APIs. */
interface ISchemaFacade
{
/* Adds a single class */
void addType(IType rdfsClass);
/** Retrieves all Classes */
List<IType> getTypes();
/* Get a single type by its ID */
IType getTypes(String id);
/* Adds a single property */
void addPropertyType(IPropertyType property);
/* Get all Properties */
List<IPropertyType> getPropertyTypes();
/* Gets a single property by its ID. */
IPropertyType getPropertyType(String id);
/* Add a single metadata entry */
void addEntry(IMetadataEntry entry);
/* Get a single metadata entry by its ID */
IMetadataEntry getEntry(String id);
/* Get all metadata entities */
List<IMetadataEntry> getEntries(String rdfsClassId);
}
API Reference Implementation in Java
A working implementation of the API for Java (source and compiled) can be found
under: ./lib/src.
A compiled jar can be found under: ./lib/java/bin.
The dependencies are specified in the module's build.gradle
file: ./lib/java/src/build.gradle.
API Reference Examples in Java
Working examples of the API in java to read and write can be found
at: ./, specifically the class
files
-
./lib/java/src/java/ch/eth/sis/rocrate/example/ReadExample.java -
./lib/java/src/java/ch/eth/sis/rocrate/example/WriteExample.java
Ongoing Work
- Adding complex data types
- Using
rdfs:Labelto indicate the original name of a property (this could also help in resolving properties with the same name) - Validation of data types expressed in the schema, e.g. enforcing ISO 8601 for dates
- Bundling ontologies in the RO-Crate
- Find a way of specifying other data formats
Possible Future Directions
- We would like to store the schema and metadata information in separate files and indicate the
format of the file in
ro-crate-metadata.json - Other serialization formats could be supported when using separate files
- Adding methods for deleting to facade to have all CRUD operations
People
- Andreas Meier ([email protected])
- Juan Fuentes ([email protected])
Thanks for this -- I don't think the name of this issue really captures what you are proposing but to be honest I'm not clear on what the point of this proposal is. Could you phrase this in the form "As a $type-of-user want to $do-something".
I gather this is about specifying the schema for tabular data in a format-independent way -- if this is correct then it would be good to consider how this differs from previous work like CSV for the web and other approaches.
Consider also if this could be a profile of RO-Crate? Does RO-Crate need to change to do this?
Dear Peter, thank you for your comment.
You got the idea of specifying the schema and metadata correctly.
Both CSV for the web and RDF are currently used to exchange this type of information. Our choice is RDF. Ultimately, the format is not as important as the community deciding to support one, with the libraries abstracting the format anyway.
We are totally with you; we have overlooked the possibility of publishing it as a profile. We will do it.
Can you create an example ro-crate meta file which some research example? Some people work best with examples and it makes the suggestion concrete.
@simontaurus: what is your take on this idea?
I agree with @SteffenBrinckmann, this needs examples of how a database schema is to be represented in RO-Crate. Assuming the point of this is to describe the schema of some Files that have been included in a package I would like to see an example which describes a schema, and links to to the file, eg with a property conformsTo on the FIle.
I did manage to find an example but it does not have any File entities in it and there are a few issues with the RO-Crate in that example. Amongst other problems it uses hasParts rather than hasPart and does not have a name, datePublished that are required properties on the Root Data Entity. There are also several other undefined properties like hasChildren and some odd looking path like @id values starting with : (:/JOHN/JOHN/ENTRY2) which I guess have some relationship to the tabular structure.
As I mentioned in my previous comment I think if you want to represent schemas for tabular data you should look at CSV for the Web (CSVW) which has a standard way of representing columns in tabular data in RDF https://w3c.github.io/csvw/syntax/ which, I think has some of the conventions you need for representing columns in tables etc.
See also all the ideas in this long-running (unresolved) issue https://github.com/ResearchObject/ro-crate/issues/27
Using RDFS Class and RDFS Property to create a data schema is pretty close to an OWL Ontology. In this case I would suggest to simply publish the ontology/schema under its IRI to that https://my-ontoloy-schema.org#SomeClass resolves to SomeClass with http content negotiation.
Within RO-CRATE one can simply use additional values for @type to express the schema class for an node in the @graph, e.g. @type: ["Dataset", "https://my-ontoloy-schema.org#SomeClass"]
However, since many struggle with RDF schemas, my impression is that providing object oriented linked data schemas a profiles to define nodes in the RO-CRATE @graph and map them directly to (data) classes in object oriented programming (including java, and python) or document schemas for a document store.
This can be done with something llike LinkML or directly in industry standards JSON-SCHEMA + JSON-LD (see https://github.com/OO-LD/schema).
Beside guiding the user better, this also enables data validation and UI generation based on the same declarative language.
We are working on some first examples, see https://github.com/TheELNConsortium/TheELNFileFormat/issues/67#issuecomment-2409001970
@SteffenBrinckmann We updated our example to use less abstract data. We now have a file describing three publications and their authors and publishers. https://sissource.ethz.ch/sispub/ro-crate/-/blob/main/interoperability/0.1.x/examples/ro-crate-1.1/ro-crate-metadata/ro-crate-metadata.json?ref_type=heads. We hope this gives a clearer picture of the format.
@AndreasMeier12 THANKS, interesting. Sorry for keep pushing:
- you have metadata inside but not yet any data file. Can you include a data file, like a two row csv file?
The file looks in many ways to the conventional metadata.json file, it only uses different keys (rdf it seems). But other than the different name, what else was missing in the conventional metadata.json file?
Concerning the examples: Hi @SteffenBrinckmann,
I believe the example provided by @AndreasMeier12 primarily focuses on metadata exchange, without incorporating any actual data, datasets, or similar content.
Traditionally, ro crate metadata files do not include a mechanism for defining a type within the file itself. While one might argue that referencing a type or ontology via the namespace could serve this purpose, this approach is not consistently applied and would additionally require a dedicated HTTP endpoint to serve such files. Embedding this information directly within the RO-Crate metadata file offers the advantage of reducing dependency on external services and ensures clarity regarding the meaning of those fields at the time the RO-Crate was created, particularly since ontology links are often subject to link rot.
Regarding the serialization format: I am personally open to exploring alternative serialization formats, such as those suggested by @ptsefton and @simontaurus, provided they offer the necessary feature set and are practical to support and maintain.
We are open to including an additional field to indicate the serialization format, provided there is interest and commitment from someone willing to contribute to the implementation of alternative formats.
In my view, the choice of serialization format is of secondary importance, as long as interoperability is achieved through the API and the use of semantic annotations.
I am still confused about the intent of this profile. It seems to be about database schemas "This proposal SHOULD allow the means to exchange a database schema and database contents in astandardized way." which is something that is handled by the W3C work on tabular data formats but the examples are about adding class and property definitions to RO-Crates, which is already covered in the spec. See this section in the draft 1.2 spec on adding extra defintions. https://www.researchobject.org/ro-crate/specification/1.2-DRAFT/appendix/jsonld#extending-ro-crate
We have been using this approach for some time with our work at the Language data commons of Australia. Here's an RO-Crate metadata document with lots of definitons in it: https://github.com/Language-Research-Technology/language-data-commons-vocabs/blob/master/ontology/ro-crate-metadata.json
We have a "Mode file" which implements an implicit profile to allow people to edit these kinds of Schema.org Style schemas which is built in to Crate-O. Have a look at Crate-O using Chrome or MS edge and you can see that it has built in support for editing Schemas: https://language-research-technology.github.io/crate-o/#/
Re the example here: https://sissource.ethz.ch/sispub/ro-crate/-/blob/main/interoperability/0.1.x/examples/ro-crate-1.1/ro-crate-metadata/ro-crate-metadata.json?ref_type=heads. This is still not a real example and I am not sure why you chose this domain as Schema.org already does a good job of descrining publications, with various types suchs as ScholarlyArticle - with appropriate properties - the whole point of RO-Crate is to encourage people to USE this rather than making up their own vocabularies.
RE the comments about about an API -- RO-Crate is NOT an API specification and profiles are typically descriptions of the interchange format.
Dear @ptsefton, I wish you were there during our first talks last year with @elichad.
Background:
To give you a bit of background on our use case, we currently maintain a system being used by at least 500 research laboratories worldwide; we maintain around 100 of them.
These researchers may want to exchange metadata between systems (e.g., ELNs and data repositories).
Initial Analysis:
During our discussions, we identified that currently there is no good way to extend the context to include additional definitions of metadata dynamically.
The proposal for the 1.2 DRAFT and what you have been doing rely on the existence of links to external definitions.
Our Requirements:
Furthermore, URLs on the context are not necessarily machine-actionable; The definitions should be able to validate the data being exchanged and not rely on a common schema between the systems exchanging data, but rather on the semantic annotations to extract meaning.
I hope this brings some light on how this specification came to be.
Mode file / Implicit profile
I'm not sure this can fulfill our use case, but we are more than happy to develop our specification further and use any kind of serialization that could fit better within the RO-Crate standard.
I feel your experience with RO-Crate would really help us. Would you be open to joining us during a call so we can discuss further?
Would the next RO-Crate drop-in on 1st May (Europe) work for you?
@SteffenBrinckmann We are currently focusing on the exchange of metadata, therefore we do not include datasets in the examples. The machine readable schema was not yet part of the metadata json. Furthermore, metadata that would have been exchanged as files, e.g. CSV or RDF. could also be exchanged this way. Information exchanged this way can be automatically processed using the API.
I thought I replied to this -- yes I am able to come. In the meeting European drop in meeting now! @juan-fuentes-sis
Initial Analysis: During our discussions, we identified that currently there is no good way to extend the
contextto include additional definitions of metadata dynamically.The proposal for the 1.2 DRAFT and what you have been doing rely on the existence of links to external definitions.
Our Requirements: Furthermore, URLs on the context are not necessarily machine-actionable; The definitions should be able to validate the data being exchanged and not rely on a common schema between the systems exchanging data, but rather on the semantic annotations to extract meaning.
RDF is the common data schema of all application that support RO-Crate, Since RO-Crate is based on JSON-LD which is a serialization of RDF. Also, JSON-LD already defines a syntax for adding both new inline termin to an existing base context (as the example of ptsefton shows) and refering to additional external context documents (see https://www.w3.org/TR/json-ld/#the-context).
Sadly I saw this too late to join @ptsefton and I did took advantage of our holiday in Basel. The next Europe drop in is the 21th. We can wait for the 21th or I can just provide an open Zoom link to discuss this earlier who anyone interested can join. @ptsefton Do you have any slot this or next week? What about this Wednesday or Friday?
@simontaurus taking the example of your link:
It indicates that for the sample document:
{
"http://schema.org/name": "Manu Sporny",
"http://schema.org/url": {
"@id": "http://manu.sporny.org/"
↑ The '@id' keyword means 'This value is an identifier that is an IRI'
},
"http://schema.org/image": {
"@id": "http://manu.sporny.org/images/manu.png"
}
}
Specifies this context:
{
"@context": {
"name": "http://schema.org/name",
↑ This means that 'name' is shorthand for 'http://schema.org/name'
"image": {
"@id": "http://schema.org/image",
↑ This means that 'image' is shorthand for 'http://schema.org/image'
"@type": "@id"
↑ This means that a string value associated with 'image'
should be interpreted as an identifier that is an IRI
},
"homepage": {
"@id": "http://schema.org/url",
↑ This means that 'homepage' is shorthand for 'http://schema.org/url'
"@type": "@id"
↑ This means that a string value associated with 'homepage'
should be interpreted as an identifier that is an IRI
}
}
}
The first drawback I see is that JSON-LD on its own does not define validation rules like "mandatory properties" or "data types" in a strict schema sense. Instead, it focuses on linking data semantically using @context. With this approach we don't know to what Type/Class those properties belong when reading the context.
In other words we end up with an incomplete schema definition.
Do we all acknowledge this? Can we discuss what features we are missing and how is best to provide them?
@juan-fuentes-sis
The first drawback I see is that JSON-LD on its own does not define validation rules like "mandatory properties" or "data types" in a strict schema sense.
Thats correct, JSON-LD only provides a mapping / annotation. Regarding data types, you can use @type to indicate data types within or beyond the JSON primitives, e.g.
"my_number": {
"@id": "ex:hasNumber",
"@type": "xsd:int"
}
but there's no validation.
However, there's JSON-SCHEMA to exactly fill this gap. In order to bundle both schema and context in a single document we drafted a meta-standard, see OO-LD
@juan-fuentes-sis I can't do the 21st as that instance of the RO-Crate drop in is not practical for me. The first Thursday in June I can be at the drop in.
I am exploring the idea of extending the current RO-Crate approach of using Schema.org Schemas (rdf:Property and rdfs:Class definitions and schema:DefinedTerm) with additional properties and concepts from SHACL (eg sh:minCount 1 for properties) -- the goal is to create something that is powerful enough to specify RO-Crate profiles for both validation and generation (eg driving an editor like Crate-O) and still be able to be shipped as an RO-Crate in flattened JSON-LD.
I should have some progress to report on this in June.
@ptsefton we can delay our meeting to June then.
Your extensions overlap with what we are trying to achieve. We are currently reworking the specification so we can use other standard for writing schemas.
On the other hand if we could add some properties to the current context we could achieve the same, but this seems to not to be valid JSON-LD context. Maybe the validation tool we used doesn't allow for any unknown properties, not sure this would actually be valid.
Would it not make sense that you join our efforts and contribute to this module so it can be added eventually into the standard?
@juan-fuentes-sis I am still finding it hard to understand your profile document as it is a mixture of a profile as we would expect to see for RO-Crate but the text includes an API definition which would not be part of an RO-Crate profile and there are no small examples in the profile document to explain your approach.
A couple of things you might want to clear up:
- The link above goes to some kind of code management system online and the link the spec goes to v1.1 - though I see there is a v.1.2
- You mention databases and database schemas at the top -- I initially assumed this spec was for tabular data but I still don't have any idea about what the data you are describing looks like
- The example seems to describing a vocab that is very similar to stuff you can do in Schema.org - I think it would be good to explain why you need to do this rather than just using the standard RO-Crate approach
- You mention RDFS:Property I'm not sure if that exists, but the current RO-Crate approach uses rdf:Property and rdfs:Class see https://en.wikipedia.org/wiki/RDF_Schema
- You are using local IDs that do not start with a # like "@id" : "PUBLICATION_PUBLISHER" this probably should be "#PUBLICATION_PUBLISER"
Overall, if I am understanding things I think the approach you are taking has some interesting aspects but it is going to result in extremely verbose crates which will make it complex to write software for validation.
@ptsefton We are going to release version 0.3, and following your suggestion, we will:
- Strip the API out of the spec.
- Include a longer explanation of the motivations.
- Provide better examples.
Ultimately, we want to be able to serialize any kind of schema or data—not just those based on relational databases.
We are still exploring different ways to encode this. First, to include it in ro-crate-metadata.json and make it part of the spec. Second, we are also considering the possibility of using external files in other standard formats. You mentioned CSVW, which so far seems to be the most well-standardized format to me.
When looking at encoding a schema and its associated entities as RDF/JSON-LD (your currently mentioned approach), what we did in our 0.1–0.2 spec, OOLD, Protégé, Apache Jena, etc.—all of them approach this slightly differently. As a result, what you write using a tool made for one is not necessarily compatible with another.
So, we need to reach a consensus. Is there a way to do this within the community, or is it up to the steering board?
In any case, it’s important for us to reach agreement. Even as an optional module for 2.0.
If this becomes part of the main RO-Crate standard, we’re willing to extend both the Java and Python RO-Crate libraries to support it.
I would like to be clear here that at this stage of development this profile/proposal is not something that you should be considering as a pitch to become part of RO-Crate. I suggest that you develop the tools and profiles you need for your use case in a prototype and then present that so that the RO-Crate team can see what it is you are doing. I still don't understand the context of this proposal as the description at the top is still very vague and does not describe a concrete use case. If it is to do with describing database schemas then defining Classes and Properties is only part of the job - you need to map to table and column descriptions, all of this requires both small explanatory examples and complete RO-Crates with real data that make it clear WHY you are proposing this.
Meanwhile, as I said I'm working on a profile which extends the existing RO-Crate practice of using rdf:Property and rdfs:Class to the point it can be used for validation -- using these entities in an RO-Crate graph does not require a spec change anyway and it looks like what you're doing is on the same track. We can discuss in June.
A new version of the spec has been added to the issue body.
Dear @ptsefton, @simontaurus and @SteffenBrinckmann ,
Following some of your recommendations:
- The Motivation and Goals sections aim to clarify the use case, outline the requirements, and explain why the current specification is insufficient.
- This new draft is designed to be flexible, allowing for multiple formats to achieve the intended goals, which may help address use cases from other parts of the community.
- It also supports the use of external files to avoid inflating the
ro-crate-metadata.jsonfile. - Examples are included directly in the proposal.
- Discussions about an API have been intentionally excluded to maintain focus.
Looking forward to hearing from you soon.
I'm closing this issue -this is not an appropriate use of the RO-Crate issues.
Interested parties should follow along at the repository where it is being maintaned.
https://sissource.ethz.ch/sispub/ro-crate/-/tree/main/interoperability?ref_type=heads
I think it is important to have a place where to communicate with the broader community @ptsefton where everyone can comment, sadly we don't have github for this.
I would like this tickets get reopened, the closing can send a negative message.
Ok @juan-fuentes-sis I have re-opened this. I think that if you need a venue to discuss it you could set up a repository on github - you have an account.
The scope of what you are trying to do is very large, it now appears you are trying to invent a general purpose ontology to exhange information between arbitrary IT systems, and it is still not clear to me why this interchange is necessary or what the actual data looks like - and the example "Person" is something we would model in RO-Crate using schema:Person. If I am reading this correctly this profile is well out of scope for inclusion in RO-Crate - hence why I don't think this is the place to discuss it.
I will see you at the drop in in June but I don't have time to keep monitoring this issue.
Here;s the work I have been doing in this space: https://github.com/Language-Research-Technology/ro-crate-schema-tools/blob/sossplus/profiles/sossplus/sossplus-profile.md
Just a note for those who are considering attending a drop-in to discuss this thread: the next European drop-in will take place one week later than usual, on 12 June at 09:00 UTC (immediately after the RO-Crate community call). This is because both Stian and I are unavailable on the usual date.