Create file for PubChem deposition at every release
It would be great if we could auto-create a file to deposit in PubChem with every stable release of MassBank-data. To discuss: compound information only (=> relatively easy) or mappings with spectral IDs (slightly more info needed) or actual spectra as well (more work our side). Shall we start with getting a deposit file for compound information only? Then we need e.g.:
PUBCHEM_EXT_DATASOURCE_REGID <= InChIKey, or any unique identifier our side PUBCHEM_EXT_DATASOURCE_SMILES <= SMILES PUBCHEM_EXT_DATASOURCE_CID <= PubChem CID (if available) PUBCHEM_SUBSTANCE_COMMENT <= here we could e.g. provide accession IDs, collapsed PUBCHEM_SUBSTANCE_SYNONYM <= any names our side (can have multiple columns, but maybe e.g. max 3 would be sensible)
@meier-rene @sneumann @tsufz what do you think? If yes, who will look after the file? I would contact PubChem to get us a MassBank login for deposition, so credit goes to MassBank(EU) and we can track our submissions.
Hi, we have started to embed bioschemas information, e.g. Line 60+ in view-source:https://msbi.ipb-halle.de/MassBank/RecordDisplay2?id=PB006301 and I would prefer if PubChem adopts that. Benefit would be that this way they can scrape other Bioschemas compatible stuff, I would reckon that Wikipathways also embeds such information. Otherwise we end up generating and maintaining mappings for PubChem, ChemSpider, CompTox, ... separately. Yours, Steffen
Let's discuss with Evan and @egonw at Dagstuhl then ...
adding @alasdairgray in to contribute his Bioschemas wisdom ...
@AlasdairGray, I guess a good first step forward is to have that aggregator website you just showed crawl the MassBank website and extract the chemical structures and data record JSON-LD.
In addition / alternatively to scraping the JSON-LD we can design a REST query that would deliver this. What exactly is needed ? CSV ? Can we have a pointer to an example deposition file ? Yours, Steffen
This is the CSV format I use now:
PUBCHEM_EXT_DATASOURCE_REGID,PUBCHEM_SUBSTANCE_SYNONYM,PUBCHEM_EXT_SUBSTANCE_URL,PUBCHEM_EXT_DATASOURCE_SMILES
Q127900,β-lactose,https://tools.wmflabs.org/scholia/Q127900,C([C@@H]1[C@@H]([C@@H]([C@H]([C@@H](O1)O[C@@H]2[C@H](O[C@H]([C@@H]([C@H]2O)O)O)CO)O)O)O)O
Q128381,harmine,https://tools.wmflabs.org/scholia/Q128381,CC1=NC=CC2=C1NC3=C2C=CC(=C3)OC
Q128540,zirconyl chloride,https://tools.wmflabs.org/scholia/Q128540,[Cl-].[Cl-].[O].[Zr+2]
Q129163,tin(IV) oxide,https://tools.wmflabs.org/scholia/Q129163,O=[Sn]=O
Q130175,butane-1-selenol,https://tools.wmflabs.org/scholia/Q130175,CCCC[SeH]
Q130336,phenol,https://tools.wmflabs.org/scholia/Q130336,C1=CC=C(C=C1)O
Q130365,riboflavin,https://tools.wmflabs.org/scholia/Q130365,CC1=CC2=C(C=C1C)N(C3=NC(=O)NC(=O)C3=N2)C[C@@H]([C@@H]([C@@H](CO)O)O)O
Q131189,propane,https://tools.wmflabs.org/scholia/Q131189,CCC
Q131994,veratrole,https://tools.wmflabs.org/scholia/Q131994,COC1=CC=CC=C1OC
@schymane is there any progress with the implementation in PubChem? I guess, you are in contact with Evan anyway. However, if not, I could contact him to discuss the topic further.
BTW., our schema still has not the Description tag. See https://github.com/MassBank/MassBank-web/issues/206
We need a consistent deposit file created MassBank side – preferably automatically with every release (see subject line). Then Jeff can grab and update automatically on the PubChem side. If we design it to have DTXSIDs in it as well, this is all CompTox would need. The process is simple … but the file needs to come from MassBank side. It’s basically the deduplicated record summary (i.e. a summary of unique compounds, not records). Could easily be auto-generated …