SemMedDB Novelty?
Should we ponder if/how we might leverage this "Novelty" idea in SemMedDB?
https://togithub.com/NCATSTranslator/minihackathons/issues/313 (non-autoreferencing link)
Thanks for bringing this to my attention. Do you know if they are talking about SUBJECT_NOVELTY and OBJECT_NOVELTY scores as defined on the SemMedDB Database Download page?
I do not know. The way the conversation went, I did not understand that there were two. I assumed (apparently incorrectly) that each assertion had one Novelty score. I don't know much about it.
Looks like they are talking about SUBJECT_NOVELTY and OBJECT_NOVELTY scores. Andrew mentions the score of 1 in relation to the object cytokine and if you look at the json he posted you can see it under the attributes of this node.
Let's keep #1551 in mind when addressing this issue
And #1658
Current status: Finn says that these novelty scores are in KG2, but perhaps not in KG2C, so asking @amykglen if they are, and if so, how to access them.
Then it's reasoning team's responsibility to handle these in the ranker and incorporate in #1695
what are the properties containing these scores called in KG2pre? are they subject score and object score (that appear under publications_info)? (@saramsey, @acevedol) if so, those are already in KG2c and are loaded into TRAPI attributes. if not, then they're not yet in KG2c. but could be added with a little more info.
this is an example of a TRAPI edge that has such a subject score and object score filled out (underneath the PMID entry): https://arax.ncats.io/?r=44461
"infores:rtx-kg2:UniProtKB:P05181-biolink:produces-CHEMBL.COMPOUND:CHEMBL112":{
"attributes":[
{
"attribute_source":"infores:rtx-kg2",
"attribute_type_id":"biolink:aggregator_knowledge_source",
"attributes":null,
"description":null,
"original_attribute_name":null,
"value":"infores:rtx-kg2",
"value_type_id":"biolink:InformationResource",
"value_url":null
},
...
{
"attribute_source":"infores:semmeddb",
"attribute_type_id":"bts:sentence",
"attributes":null,
"description":null,
"original_attribute_name":null,
"value":{
"PMID:22224048":{
"object score":1000,
"publication date":"2011 Oct",
"sentence":"CONCLUSION: These results suggest that goldenseal ameliorates APAP-induced ALF and that this protection can likely be attributed to the inhibition of CYP2E1 activity, which generates the highly reactive intermediate of APAP.",
"subject score":888
}
},
"value_type_id":null,
"value_url":null
}
],
"object":"CHEMBL.COMPOUND:CHEMBL112",
"predicate":"biolink:produces",
"subject":"UniProtKB:P05181"
}
Hi @amykglen, I see the subject score and object score in the publications_info on edges in KG2pre
['CHEMBL.COMPOUND:CHEMBL1201290', 'UMLS:C0304475', 'biolink:close_match', 'infores:atc-codes-umls|infores:semmeddb', 'PMID:27413123', "{'PMID:27413123': {'publication date': '2016 Aug', 'sentence': 'CONCLUSIONS: The bioavailability of potassium is as high from potatoes as from potassium gluconate supplements.', 'subject score': 1000, 'object score': 890}}", 'ATC:A12BA---oboFormat:xref---UMLS:C0304475---umls_source:ATC|UMLS:C0032821---SEMMEDDB:same_as---UMLS:C0304475---SEMMEDDB:', '1355\n']
thanks @acevedol. I'm specifically wondering if those subject score and object score properties correspond to the semmeddb "SUBJECT_NOVELTY" and "OBJECT_NOVELTY" scores being requested in this issue?
This page explains what "novelty" means in the modern SemMedDB:
https://lhncbc.nlm.nih.gov/ii/tools/SemRep_SemMedDB_SKR/SemMedDB_download.html
See:
The GENERIC_CONCEPT table has been updated in the June 30 2018 and all subsequent releases. Consequently, the SUBJECT_NOVELTY and OBJECT_NOVELTY columns of the PREDICATION table have been updated as follows: If the concept is not in the GENERIC_CONCEPT table, the value is set to 1; otherwise, it is set to 0.
The linked page also explains (sort of) what a GENERIC_CONCEPT is:
A GENERIC_CONCEPT table has been added to the schema. This table contains generic concepts, as indicated by SemRep. The concepts that are not in this table are considered novel.
Most subject CUIs are "novel" per the above definition, as you can see here:
mysql> select SUBJECT_NOVELTY, count(PREDICATION_ID) from PREDICATION group by SUBJECT_NOVELTY;
+-----------------+-----------------------+
| SUBJECT_NOVELTY | count(PREDICATION_ID) |
+-----------------+-----------------------+
| 0 | 9407463 |
| 1 | 103388723 |
+-----------------+-----------------------+
And for object novelty:
mysql> select OBJECT_NOVELTY, count(PREDICATION_ID) from PREDICATION group by OBJECT_NOVELTY;
+----------------+-----------------------+
| OBJECT_NOVELTY | count(PREDICATION_ID) |
+----------------+-----------------------+
| 0 | 27313951 |
| 1 | 85482235 |
+----------------+-----------------------+
In light of the above information, how important do folks think this enhancement request is? It involves some considerable effort, so wanting to get a sense for the potential benefit for ARAX/Expander-Agent or Translator before diving in.
Chris Bizon messaged me today inquiring about novelty. We should see if we can get SemMedDB novelty-based filtering added as a feature for the next release.
Closing as apparently LitCoin is going to replace SemMedDB