sweet icon indicating copy to clipboard operation
sweet copied to clipboard

Use wikidata to provide skos:definition to owl:Class'es

Open lewismc opened this issue 6 years ago • 71 comments

Building on from #20 this issue simply aims to provide rdfs:comment (and/or skos:definition or dct:description) text to all terms. Open tasks involve us collectively agreeing upon which vocabulary we wish to use e.g. rdfs:comment (and/or skos:definition or dct:description) and additionally whether we manually curate the comments or else automate this by fetching them from wikipedia/dbpedia/dictionary or elsewhere.

Any comments here?

lewismc avatar Jul 16 '19 18:07 lewismc

I have candidate term definitions for ~2K SWEET terms/classes pulled from Earth science glossaries we can sort through. Although, I'm not sure the best way to do that at present.

brandonnodnarb avatar Jul 16 '19 18:07 brandonnodnarb

Excellent :)

@brandonnodnarb where do they exist? Do you have them in electronic format somewhere?

At lunch, @dr-shorthair and I were discussing possibly just providing a dct:description (although that would introduce a brand new namespace into SWEET) which is essentially a link to an alternate, maintained description which exists elsewhere e.g. DBPedia, ENVO, .... The keyword here is maintained. I think it would be a bad decision right now for us to go ahead and implement a whole bunch f descriptions which exists solely within SWEET. On the other hand, if they do link to other, better defined, maintained descriptions then it would make sense to link to them.

Any comments @brandonnodnarb ?

lewismc avatar Jul 16 '19 19:07 lewismc

What about definitions for terms that are defined in other ontologies (notably ENVO)?There are many ENVO terms that now use the GCW terminology definitions (though that document isn’t published yet). It would be good to reference them directly, rather than reinvent the wheel.

Ruth

Sent from my iPad

On Jul 16, 2019, at 12:55 PM, Lewis John McGibbney [email protected] wrote:

@brandonnodnarb where do they exists? Do you have them in electronic format somewhere?

At lunch, @dr-shorthair and I were discussing possibly just providing a dct:description (although that would introduce a brand new namespace into SWEET) which is essentially a link to an alternate, maintained description which exists elsewhere e.g. DBPedia. The keyword here is maintained. I think it would be a bad decision right now for us to go ahead and implement a whole bunch f descriptions which exists solely within SWEET. On the other hand, if they do link to other, better defined, maintained descriptions then it would make sense to link to them.

Any comments @brandonnodnarb ?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.

rduerr avatar Jul 16 '19 20:07 rduerr

I completely agree @rduerr

lewismc avatar Jul 16 '19 20:07 lewismc

@lewismc these are in a spreadsheet. i'll see if I can clean it up and post it somewhere for review.

Also, I didn't think of this until you mentioned it, but another option could be to push these things to wikipedia/dbpedia, or make sure they are included and cited (and maintained) there. Hmmm...let me think about this a bit.

brandonnodnarb avatar Jul 17 '19 08:07 brandonnodnarb

Yes ideally we could even get to this on Thursday as well. I thinl pusing to DBPedia would be an excellent idea. It would be excellent for us to re-use and/or make available as much of this to the wider audience. As this is a pretty large task, the best way may infact be the easiest way e.g. automating pulling comments from DBPedia. An example, very simple SPARQL query can be found as follows

prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> 
select distinct ?comment where {
  <http://dbpedia.org/resource/Gazetteer> rdfs:comment ?comment .
  FILTER (lang(?comment) = 'en')
}

Ofcourse we would merely substitute the subject IRI with whatever term we get from ESIP and then experiment with FILTER regex or other functions. Thoughts?

lewismc avatar Jul 17 '19 15:07 lewismc

From a usability standpoint, an embedded description is much nicer (because it is there in front of you), and a little more confidence-inducing, because (a) it implies the author of the ontology (SWEET) vouch for it, (b) it is likely to be coherent with the purposes of the ontology, and (c) it is unlikely to drift without explicit reason. Add to that the opportunity for providing definitions that specifically disambiguate the term from its siblings, and fill the term space.

Some of these sources may achieve some of these goals. @lewismc, is your particular proposal to find the comments and 'bring them in', or simply to reference them in their original location? If the former, will the process be re-run every few years, or will we freeze this moment in our definitions?

Is it worth considering both a local copy of the definition and a reference to the source comment?

I'll be OK with whatever approach y'all think is reasonable and achievable. If there is decision-making involved, an ideal presentation would be a Google spreadsheet with the SWEET ontology name, term name, label, and external descriptions from whatever sources we are considering. That would make it easy to review the whole set at once as well as comment or vote on sources for particular terms, should it come to that.

graybeal avatar Jul 17 '19 17:07 graybeal

Hi @graybeal excellent questions, thanks for jumping in. You make some good points which I appreciate.

is your particular proposal to find the comments and 'bring them in',

If we did this, we would be essentially duplicating the content (and it would be appropriate to use of one rdfs:comment, skos:definition or dct:description). This is not to say that the things being represented are equals but merely that the way the thing is described is identical at that point in time. As you state, the actual literal values (from where they were acquired and where they exist within SWEET) will most likely diverge over time. Is this OK? It may be... but it may not be. More below...

or simply to reference them in their original location?

We could also look into using rdfs:seeAlso as a mechanism for addressing the above issue (to clarify, this issue is the divergence of the content we bring in and encode as either rdfs:comment, skos:definition or dct:description AND the canonical source from which we obtained the information e.g. that same rdfs:comment over at DBPedia), where rdfs:seeAlso would reference the original source e.g. http://dbpedia.org/resource/Jet_Propulsion_Laboratory from which the rdfs:comment literal was extracted and rdfs:comment would be the actual literal content. If this explanation is not clear then please let me know. An example would be as follows

Consider the following https://github.com/ESIPFed/sweet/blob/eb8106314e4df9aae9042b792cfff749290bf8f3/src/reprDataProduct.ttl#L84-L87 Once the above work was done it would look as follows

###  http://sweetontology.net/reprDataProduct/Dataset
dprepr:Dataset rdf:type owl:Class ;
               rdfs:subClassOf dprepr:DataProduct ;
               rdfs:label "dataset"@en ;
               rdfs:comment "A  data set (or dataset, although this spelling is not present in many  contemporary dictionaries) is a collection of data. Most commonly a data  set corresponds to the contents of a single database table, or a single  statistical data matrix, where every column of the table represents a  particular variable, and each row corresponds to a given member of the  data set in question. The data set lists values for each of the  variables, such as height and weight of an object, for each member of  the data set. Each value is known as a datum. The data set may comprise  data for one or more members, corresponding to the number of rows."@en ;
               rdfs:seeAlso http://dbpedia.org/resource/Data_set .

If the former, will the process be re-run every few years, or will we freeze this moment in our definitions?

I'm not sure about this. We need to think it through.

lewismc avatar Jul 17 '19 18:07 lewismc

A bit confused by this discussion as I am used to ontologies where the definitions are authored by the developers of that ontology (sometimes adapted from an external source, with attribution). Randomly bringing in dictionary definitions could lead to incoherence, and how do we know the definitions reflects the intended meaning?

Yes, I have my handy blog post OntoTip entry for text definitions as well: https://douroucouli.wordpress.com/2019/07/08/ontotip-write-simple-concise-clear-operational-textual-definitions/

Regardless of who writes them and what pipeline you use, it's super-important to track provenance of definitions, e.g. via axiom annotation

cmungall avatar Jul 18 '19 21:07 cmungall

OK, so who is the developer of SWEET these days? (Presumably the people who are currently maintaining it?)

And how does that developer now create appropriate definitions, if not by referencing existing expertise?

graybeal avatar Jul 19 '19 00:07 graybeal

OK, so for the developer question, I do think ENVO's micro-citation is useful. In other words, if I make a change to a term (any change) I annotate the change with my ORCID. I also like using DBXREF's to cite the original source of the definitions. I haven't looked at DBPedia; but perhaps that should be where I dump all the GCW terms and definitions?

While I do like having embedded definitions, I really hate the idea of having to update the same definition in more than one place. Would having them in DBPedia help with this problem?

Also, I note that all the cryospheric terms and definitions and sources for those can be provided as csv file if that is helpful.

Thoughts?

rduerr avatar Aug 12 '19 22:08 rduerr

@rduerr

Would having them in DBPedia help with this problem?

Yes it would, we would then look at the comment over in the DBPedia resource and determine whether we want a hard mapping.

@cmungall thanks for chiming in. I agree with @graybeal here in that the response to your statement

...where the definitions are authored by the developers of that ontology

That is essentially us. Raskin et al. never added simple labels or verbose descriptions so it is down to us to annotate and contextualize whatever we feel is necessary.

IMHO DBPedia is the best resource I've come across where we can leverage existing knowledge. We can even do this one Class at a time with one pull request. Then every one of the proposed augmentations could be scrutinized.

Does this sound logical or is it way off?

lewismc avatar Aug 15 '19 05:08 lewismc

I'm wrestling with implications here, mostly because these external definitions are not versioned, are they? So please pardon my TLDR comments.

If we embed (copy) a definition we are then claiming it as our own, and ours won't track any changes to the original source (which may be the best thing); or if it does track changes, we'll have an ongoing monitoring task. In any case yes we'll need to evaluate each one.

If we link to definitions that live elsewhere, we still have the monitoring issue (what if that definition changes enough to make it wrong for SWEET?). And if we make the link a hard one (along the lines of sameAs or exactMatch) we are effectively claiming it as our own, and therefore still have to track any changes made to the original to see if we agree. So we're effectively back at the first option.

I don't think we can support either of these approaches, even if we could create a great first version. And SWEET is not an authoritative real-world model that can be used for detailed reasoning about the world, and we can't pretend we will be able to come up with all-knowing definitions for these terms. It makes more sense to me to give people pointers to helpful information, and maintain SWEET as a relatively minimalist description of these earth science concepts.

So I think it would be best to have the definitions be notional, not authoritative. The relationship would then be 'notionallyDescribedBy', or better words to that effect, and there could be several of them, even with some contradictions between them. This best reflects the real world of SWEET in my opinion.

With that approach they could be either embedded (with the definitions sourced in the provenance, and updated automatically from the original content); or referenced remotely (though that makes SWEET less handy to use).

I'd prefer the embedded option, where multiple embedded definitions have been pulled from other sources (with date, source citation, and process citation). That follows best practices as far as I'm concerned.

graybeal avatar Aug 15 '19 06:08 graybeal

@pbuttigieg @cmungall Your take on this?

rduerr avatar Aug 15 '19 17:08 rduerr

How about

###  http://sweetontology.net/reprDataProduct/Dataset
dprepr:Dataset rdf:type owl:Class ;
               rdfs:subClassOf dprepr:DataProduct ;
               rdfs:label "dataset"@en ;
               skos:definition  [ 
                   rdfs:comment  "A  data set (or dataset, although this spelling is not present in many  contemporary dictionaries) is a collection of data. Most commonly a data  set corresponds to the contents of a single database table, or a single  statistical data matrix, where every column of the table represents a  particular variable, and each row corresponds to a given member of the  data set in question. The data set lists values for each of the  variables, such as height and weight of an object, for each member of  the data set. Each value is known as a datum. The data set may comprise  data for one or more members, corresponding to the number of rows."@en ;
                   dct:source <http://dbpedia.org/resource/Data_set> ;
                   dct:created "2019-08-16T11:35:21.06Z"^^xsd:dateTimeStamp ;
             ] .

The range of skos:definition is rdfs:resource. This gets you the text locally along with the citation and the date it was copied. Of course the downside is that its now a property path skos:definition/rdfs:comment rather than just a simple property, but the complexity is not more than the problem being solved.

dr-shorthair avatar Aug 15 '19 22:08 dr-shorthair

This makes sense to me. I could easily code up something which opens a new pull request for every hit that we get from DBPedia. Let's see if we can get any more consensus...

lewismc avatar Aug 15 '19 22:08 lewismc

Ignoring the temptation to comment on the definition :-), I like this. Presumably there can be multiple definitions, which I think is helpful to prevent people from trying to "reason over the definitions" (or argue over the definitions, equally to the point). Good general-purpose definitions are very hard to build, so most aren't that good; the meaning is in the interplay of definitions.

In the interest of rigor, can the date be an ISO 8601 date+time+time zone? How does RDFS feel about (read: tolerate) that format?

@lewismc What about doing all the pull requests automatically in a branch, then push them all to a Google table (or similar) for each review/comment? (a) You don't want to give someone carpal tunnel approving pull requests, and (b) the likelihood should significantly favor acceptance, with a definition rejected only if there is an agreement it is clearly unacceptable or represents a different concept. (And in the former case, that it's just a poor definition, the disapproval could be represented by annotating the definition, rather than by not including it.) Some system to keep track of the issues and rejections for future updates would be very helpful to minimize future maintenance costs. But treating this as a "handy dandy reference" not as a rigorous definition means reviews could be pretty superficial, just: Is it the right concept or the wrong concept?

graybeal avatar Aug 15 '19 22:08 graybeal

Makes sense. I might go with wikidata rather than dbpedia. I also have code to do wikidata matching.

See https://github.com/EnvironmentOntology/envo/issues/833

We should request a SWEET ID property in wikidata see https://www.wikidata.org/wiki/Property:P3859

On Thu, Aug 15, 2019 at 3:44 PM John Graybeal [email protected] wrote:

Ignoring the temptation to comment on the definition :-), I like this. Presumably there can be multiple definitions, which I think is helpful to prevent people from trying to "reason over the definitions" (or argue over the definitions, equally to the point). Good general-purpose definitions are very hard to build, so most aren't that good; the meaning is in the interplay of definitions.

In the interest of rigor, can the date be an ISO 8601 date+time+time zone? How does RDFS feel about (read: tolerate) that format?

@lewismc https://github.com/lewismc What about doing all the pull requests automatically in a branch, then push them all to a Google table (or similar) for each review/comment? (a) You don't want to give someone carpal tunnel approving pull requests, and (b) the likelihood should significantly favor acceptance, with a definition rejected only if there is an agreement it is clearly unacceptable or represents a different concept. (And in the former case, that it's just a poor definition, the disapproval could be represented by annotating the definition, rather than by not including it.) Some system to keep track of the issues and rejections for future updates would be very helpful to minimize future maintenance costs.

I feel like

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ESIPFed/sweet/issues/125?email_source=notifications&email_token=AAAMMONDH5EPSLHP5U45KALQEXL3LA5CNFSM4IEEIF2KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD4NGGYY#issuecomment-521823075, or mute the thread https://github.com/notifications/unsubscribe-auth/AAAMMOJWJBGXK4GC5IPQMZDQEXL3LANCNFSM4IEEIF2A .

cmungall avatar Aug 19 '19 14:08 cmungall

Hi folks, I'm going to start work on this issue.

@cmungall I took a look at your provisional mappings from ENVO to Wikidata and they look excellent. If we could produce the same for SWEET it would be great. I'm looking into https://github.com/cmungall/wikidata_ontomatcher right now.

lewismc avatar Nov 21 '19 23:11 lewismc

@cmungall I created the following https://www.wikidata.org/wiki/Q76025584 but have a feeling it is incorrect.

lewismc avatar Nov 22 '19 00:11 lewismc

If we want Wikidata to use SWEET internally, then it needs a CC-0 license …

dr-shorthair avatar Nov 22 '19 01:11 dr-shorthair

Well I think SWEET is now in the public domain - so would a public domain mark from the Creative Commons do?

rduerr avatar Nov 29 '19 19:11 rduerr

@dr-shorthair I'm not sure that's what we want... @cmungall can you please confirm?

@rduerr I don't think we need to be concerned with this.

lewismc avatar Dec 02 '19 16:12 lewismc

@lewismc Why not? If SWEET is going to be heavily used it should have one or the other (CC0 or PDM).

rduerr avatar Dec 02 '19 20:12 rduerr

@rduerr I think SWEET may already heavily used... Additionally, SWEET is software or a software artifact. CC is a public copyright license not a software license. I must admit, I've licensed content under CC-* but not software... I just wouldn't do it. If you want to propose this to the wider audience then by all means go ahead. If you have buy in then please go ahead.

lewismc avatar Dec 02 '19 21:12 lewismc

In my experience, ontologies are licensed as content via some CC license or equivalent, not as software. Of course, I am strongly behind the idea that ontology engineering should be treated more like software engineering, and ontologies are computable artefacts. But nevertheless they are more like data than software. IMHO attempts to apply software licenses to ontologies have created confusion (worst is when people try and GPL them.. what does "linking" mean for an ontology?).

I suggest a new issue for this however, as we have strayed from the original topic...

cmungall avatar Dec 03 '19 03:12 cmungall

I took a look at your provisional mappings from ENVO to Wikidata and they look excellent. If we could produce the same for SWEET it would be great. I'm looking into https://github.com/cmungall/wikidata_ontomatcher right now.

Yes I will try and Dockerize with help from @wdduncan

I created the following https://www.wikidata.org/wiki/Q76025584 but have a feeling it is incorrect.

Looks OK. Will you make a wikipedia page as well? The sync between these is still not totally clear to me

If we want Wikidata to use SWEET internally, then it needs a CC-0 license

Correct. I support this but if there is reticence one option is to make a CC-0 subset for export. We do this for Mondo, where we don't export the text definitions.

cmungall avatar Dec 03 '19 04:12 cmungall

Can I suggest that SWEET onts use skos:definition for anything it defines - all its classes, properties etc. - and the more generic rdfs:comment for other things like notes on imported ontology elements, the ontology-level metadata etc. This is the trend for current W3C ontology work like DCAT 2 & the Profiles Vocabulary.

nicholascar avatar Dec 05 '19 02:12 nicholascar

Seems reasonable. We use IAO for definitions in OBO but we probably should have used skos.

IAO can be useful for more specific properties where rdfs:comment is too broad - curator note, usage note, etc

On Wed, Dec 4, 2019 at 6:11 PM Nicholas Car [email protected] wrote:

Can I suggest that SWEET onts use skos:definition for anything it defines

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ESIPFed/sweet/issues/125?email_source=notifications&email_token=AAAMMOJPDV2ZG3VOBSDHPNLQXBPL3A5CNFSM4IEEIF2KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEF7G7SA#issuecomment-561934280, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAMMOPRQGFRCCQH7VKFSZ3QXBPL3ANCNFSM4IEEIF2A .

cmungall avatar Dec 05 '19 04:12 cmungall

And this is why having real "best practices" for ontology development is a good idea. If only some such agreement had been reached 8 months ago, I could have followed it!

rduerr avatar Dec 05 '19 05:12 rduerr